deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.57k stars 1.82k forks source link

Updating Embeddings for Faiss DPR on Large Dataset (Batchmode) #601

Closed vinchg closed 3 years ago

vinchg commented 3 years ago

Question As mentioned in the title, I'm trying to update the embeddings for Faiss (HNSW) with a DPR retriever on a large dataset (~15 million documents). Following the tutorial steps, I'm writing the documents to a local sqlite db (~15 gb .db). However, calling update_embeddings uses all my RAM (64 GB) and all my swap (64 GB) and proceeds to run out of memory after running for hours. I'm fairly certain the line that's consuming so much memory is: https://github.com/deepset-ai/haystack/blob/3f81c93f36519ab78213f145c699ce7df2c4ddf8/haystack/document_store/faiss.py#L158

More specifically: https://github.com/deepset-ai/haystack/blob/3f81c93f36519ab78213f145c699ce7df2c4ddf8/haystack/document_store/sql.py#L124 The call to query.all() is what's causing the issue.

I'm not very familiar with SQLAlchemy and SQLite in general but it seems that there is some inefficient usage of space when querying the DB. Is there an alternative way with dealing with large datasets like this? It would be preferable if there was a batch option to update_embeddings so that a group of embeddings would be flushed to disk before proceeding.

On a side note, some type of progress indicator or option to enable one when calling write_documents or update_embeddings would be useful (considering these operations can take an hour or more).

tholor commented 3 years ago

Hey @vinchg ,

Thanks for raising this issue. It's highly relevant.

So in general: for larger datasets like this, we would recommend not using SQLite but rather Postgres. You can easily spin one up via a Docker:

docker run --name haystack-postgres -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres
docker exec -it haystack-postgres psql -U postgres -c "CREATE DATABASE haystack;"

and then connect in Python via:

document_store = FAISSDocumentStore(sql_url="postgresql://postgres:password@localhost:5432/haystack",
                                            faiss_index_factory_str=index_type)

However, this will probably only increase memory efficiency but not resolve the underlying problem. I totally agree that we should add some batch functionality to update_embeddings and a tqdm progress bar.

The call to query.all() is what's causing the issue.

Do you have already any embeddings in the document store or is that a fresh one?

Is there an alternative way with dealing with large datasets like this?

As a temporary workaround, you could generate the embeddings in batches yourself before calling write_documents() and attach them to your documents before writing to the DocumentStore. Rough sketch:

from haystack import Document
dicts = [{"text": "some text"}, ...]
docs = [Document.from_dict(d) for d in dicts]

# get the embedding for a batch of docs
batch_docs = docs[:32]
batch_emb = retriever.embed_passages(batch_docs)

# attach embeddings to the docs
for emb, doc in zip(batch_emb, batch_docs):
    doc.embedding = emb
...

# later: write everything to the doc store
doc_store.write_documents(all_docs)

Again, we should in any case add the batch functionality soon in update_embeddings() to simplify usage here ...

@lalitpagaria Would this maybe be something that is of interest to you and you would like to work on? If not, @tanaysoni can take care of it.

lalitpagaria commented 3 years ago

@tholor I would like to work on it but only after a week. So if it not something urgent then I can take it up.

tholor commented 3 years ago

Ok awesome! That sounds perfectly fine. Thank you @lalitpagaria :)

vinchg commented 3 years ago

Thank you for the response. I'll give postgres a test.

Do you have already any embeddings in the document store or is that a fresh one?

It is a fresh one.

I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to query.all() resolves quickly, but then proceeds to loop (for 15+ hours) here: https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L338

Originating from: https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L415

I added tqdm and it's giving me a ~40 hour estimation (around 9 secs per iter) just to tokenize the dataset. I am using facebook/dpr-ctx_encoder-single-nq-base for the passage embedding model with a max seq len of 256. For reference, I'm working with a 10900F and a 3090.

The issue is here: https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L2001

*edit: Comparatively, I tried tokenizing the same dataset outside of the API and it took 17 minutes. I'm working around it currently by calculating my own embeddings and adding them to the docs, but this might also be something to look into.

tholor commented 3 years ago

Hey @vinchg , The speed issue that you mention seems related to #602. We will investigate and optimize it (probably via batching and/or multiprocessing)!

tholor commented 3 years ago

I want to mention another issue I found - unrelated to the db. I pruned my dataset to a size of 1.3 mil and reran on the original configuration with SQLite. The call to query.all() resolves quickly, but then proceeds to loop (for 15+ hours) here: https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L338 Originating from: https://github.com/deepset-ai/FARM/blob/f8660466d5b78db8cb91603ef88d5988a12956a1/farm/data_handler/processor.py#L415

@vinchg Not sure if you saw this, but we fixed this one in https://github.com/deepset-ai/FARM/pull/638. A loop with O(n²) was causing the trouble...

Memory efficiency should soon be improved by #620 and we'll afterwards also introduce a batch mode for update_embeddings()...

brandenchan commented 3 years ago

Hi @vinchg, the new PR #733 should significantly improve the memory efficiency of update embeddings. Could you give this a try and let us know if it helps?

vinchg commented 3 years ago

Thank you guys! I'm a bit busy with other things atm so I'm not sure when I'll get around to testing, but when I do, I'll post my results