deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.84k stars 1.93k forks source link

Pinecone: dummy vector is not compatible with the new API #6931

Closed anakin87 closed 9 months ago

anakin87 commented 9 months ago

Discussed in https://github.com/deepset-ai/haystack/discussions/6929

Originally posted by **Boltzmann08** February 6, 2024 Hello everyone, Am trying to upsert data to pinecone. First i convert and preprocess them. But once i want to write thoses preprocessed data, i got an api error. I am running this on colab. ` !pip install farm-haystack[all] !pip install datasets #import all the necessary libraries doc_dir = "data/tutorial8" s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial8.zip" fetch_archive_from_http(url=s3_url, output_dir=doc_dir) all_docs = convert_files_to_docs(dir_path=doc_dir) preprocessor = PreProcessor( clean_empty_lines=True, clean_whitespace=True, split_by="word", split_length=100, split_respect_sentence_boundary=True ) docs_default = preprocessor.process(all_docs) #create a dictionary with the data in the 'content' key document_store.write_documents(docs_default) #need a dictionary as arg` The error message is : ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({‘content-type’: ‘application/json’, ‘Content-Length’: ‘155’, ‘x-pinecone-request-latency-ms’: ‘136’, ‘date’: ‘Wed, 31 Jan 2024 13:33:43 GMT’, ‘x-envoy-upstream-service-time’: ‘32’, ‘server’: ‘envoy’, ‘Via’: ‘1.1 google’, ‘Alt-Svc’: ‘h3=“:443”; ma=2592000,h3-29=“:443”; ma=2592000’}) HTTP response body: {“code”:3,“message”:“Dense vectors must contain at least one non-zero value. Vector ID 1f6ca8a2bd6c9903813607120d8d48bc contains only zeros.”,“details”:} when i do this : `from pprint import pprint pprint(docs_default[0])` it's return this : I found a solution which consists on creating the embeddings and upsert them direclty to the document store in pinecone without using haystack. But it's too bad to not use all what haystack can provide. Also in this case the retriever is unable tu update the embeddings once connected to the document store. Because the index is still empty for him. After reflexion and searching its seems that pinecone do not handle this : """"‘embedding’: None""" in the metedata field. But this how PreProcessor gaves its returns. Anyone did encounter this issues ?
anakin87 commented 9 months ago

Seems the same problem emerged in https://github.com/deepset-ai/haystack-core-integrations/issues/300

anakin87 commented 9 months ago

fixed in #6932