deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.96k stars 1.85k forks source link

Reopen Issue #1019 #5407

Closed 4ut0m8NT closed 4 months ago

4ut0m8NT commented 1 year ago

Describe the bug Loading existing FAISS document store with saveed index/config no longer functions in 1.18.1

It will run once. Work, perform Q/A. Reload = FAIL.

Error message ValueError: The number of documents in the SQL database (96) doesn't match the number of embeddings in FAISS (0). Make sure your FAISS configuration file points to the same database that you used when you saved the original index.

Expected behavior Q/A App Loads and works just like first run.

Additional context Test Doc = converted PDF.

PreProcessing: converter = PDFToTextConverter(remove_numeric_tables=True)

doc_pdf = converter.convert(file_path="data/preprocessing_tutorial/bert.pdf", meta=None)

  doc = converter.convert(file_path=filename, meta={'name':str(filename)})

  processor = PreProcessor(
      clean_empty_lines=True,
      clean_whitespace=True,
      clean_header_footer=True,
      split_by="word",
      split_length=200,
      split_respect_sentence_boundary=True,
      split_overlap=0
    )
  docs = processor.process(doc)
  print (docs)
  document_store.write_documents(docs)
  document_store.save(index_path="./faissshift.index", config_path="./faiss.json") --> custom
  document_store.save("my_faiss"). --> double save operation to see if your example worked better... :(

To Reproduce Use farm-haystack 1.18.1

Run an embedded retriever with 384.

Attempt to reload a 2nd time.

FAQ Check

System: OS: Ubuntu GPU/CPU: GPU Haystack version (commit or version number): 1.18.1 DocumentStore: FAISSDocumentStore Reader: deepset/deberta-v3-base-injection Retriever: EmbeddingRetriever - sentence-transformers/all-MiniLM-L6-v2 (requires 384 dim)

my_faiss.json: {"faiss_index_factory_str": "Flat", "embedding_dim": 384, "index": "documents", "similarity": "cosine", "embedding_field": "question_emb", "sql_url": "sqlite:///faiss_document_store.db"}

my_faiss (index) (binary): "IxFI�^A^@^@^@^@^@^@^@^@^@^@^@^@^P^@^@^@^@^@^@^@^P^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@"

Please advise.

Also added to closed ticket #1019 .

anakin87 commented 1 year ago

Hello @4ut0m8NT, we usually don't monitor closed issues.

Does this help? https://github.com/deepset-ai/haystack/issues/3961#issuecomment-1406213631

4ut0m8NT commented 1 year ago

Thanks @anakin87, but this isn't a syntax issue:

document_store = FAISSDocumentStore.load(index_path="my_faiss", config_path="my_faiss.json")

it produces the "ValueError: The number of documents in the SQL database (96) doesn't " if the DB or index exists...

Please advise.

4ut0m8NT commented 1 year ago

document_store = FAISSDocumentStore(faiss_config_path="./my_faiss.json", faiss_index_path="./my_faiss")

Also a Fail. Please advise.

demongolem-biz2 commented 11 months ago

Yes I get this as well. If I blow away the index and config files it will work just fine, the FAISS DocumentStore. However the save and load process no longer works.

demongolem-biz2 commented 11 months ago

Ok so I think that the tutorial which I was following at https://haystack.deepset.ai/integrations/faiss-document-store to use FAISS to perform semantic search needs to be updated because it does not show the process of saving the DocumentStore. I was performing save(), but I did not do update_embeddings() which was the crucial part I was missing. And then of course you have to update_embeddings() first and save() second so that the counts do match when you go to save.

The tutorial has two parts: the indexing pipeline followed by the query pipeline. The indexing pipeline sets up the FAISSDocumentStore and indexes. After this indexing is complete and before we run the query pipeline, that is where the update_embeddings() needs to be performed. I was anticipating it would be done during the indexing pipeline, however it is after we created the EmbeddingRetriever as part of the query pipeline, that is where the update_embeddings is run() and the save() performed. And I think for normal usage you would want to save and not just rerun this code over and over again and so that is why this process should be mentioned in the tutorial.

augchan42 commented 8 months ago

Initializing a FAISSDocumentStore can take 'faiss_index' and can also take 'index' If initializing with 'index', I also got the mismatched count error. I checked the code, the index param is ignored. So seems there's an issue with the docs and confusing naming in the params