deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.56k stars 1.82k forks source link

Faiss Document Store combined with DPR extremely slow #535

Closed maxupp closed 3 years ago

maxupp commented 3 years ago

Question I have previously used In Memory Document Stores with tf-idf Retrieval, which was pretty quick to setup. Now I have gone to the Faiss Document Store combined with Dense Passage Retrieval.

Issue 1: I have downloaded the models via Transformers CLI in order to supply them locally. When initializing from these weights, I get a warning that not all of the weights have been initialized from the pretrained model, and that some will be randomly initialized. Is that currently normal behavior? And can I persist the model Haystack pulls from remote?

Issue 2: Startup and embedding of the docs is extremely slow. After getting the "Got you 31 workers..." log, it takes about another hour before it starts printing the "Embedded x / x documents" log messages. This then takes another 3 hours to embed the roughly 150.000 text snippets I use. And that is on a GTX 1080Ti in the official haystack gpu container. I am aware that Embedding is a time consuming operation, but this still seems a bit slow to me, from my previous experiences.

Additional context This is on 0.4.0 in the haystack-gpu Docker container.

Code:

    document_store = FAISSDocumentStore()

    with open(documents_pickle, 'rb') as fin:
        docs = pickle.load(fin)

    document_store.write_documents(docs)
    retriever = DensePassageRetriever(
        document_store=document_store,
        query_embedding_model="./retriever_model/query",
        passage_embedding_model="./retriever_model/passage",
        )

    document_store.update_embeddings(retriever)

Content of passage DPR model dir: image Similar for the query embedding model-

tholor commented 3 years ago

Hey @maxupp ,

Issue 1: I have downloaded the models via Transformers CLI in order to supply them locally. When initializing from these weights, I get a warning that not all of the weights have been initialized from the pretrained model, and that some will be randomly initialized. Is that currently normal behavior?

Can you please share the exact error message? Do you still get the error message with the latest version from the master branch?

And can I persist the model Haystack pulls from remote?

Do you mean the DPR models? If you supply local paths (as in your code snippet above), it should not pull anything else from remote :thinking:

Issue 2: [...] this still seems a bit slow to me

Yes, this is very slow. In our benchmarks we measured ~ 107 docs/sec which roughly equals 385k docs / hour. This is on a V100 GPU and I haven't used a 1080Ti myself, but I would not expect this >10x difference. Can you please test if you still get these numbers with the latest master branch? We did quite many refactorings of DPR + FAISS recently and removed some bottlenecks with the SQLDocumentStore there.

maxupp commented 3 years ago

This is what the initializing yields:

[muppenkamp@gb40-ds01 ~]$ docker logs qa_qa_backend_1
Loading faiss with AVX2 support.
Loading faiss.
device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"training": false, "num_labels": 2, "ph_output_type": "per_token_squad", "model_type": "span_classification", "config": {"training": true, "layer_dims": [768, 2], "num_labels": 2, "ph_output_type": "per_token_squad", "model_type": "span_classification", "task_name": "question_answering", "no_ans_boost": 0.0, "context_window_size": 100, "n_best": 5, "n_best_per_sample": 1, "name": "QuestionAnsweringHead"}, "label_tensor_name": "question_answering_label_ids", "label_list": ["start_token", "end_token"], "metric": "squad", "name": "QuestionAnsweringHead"}
device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
Got ya 31 parallel workers to do inference ...
 0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
/w\  /w\  /w\  /w\  /w\  /w\  /w\  /|\  /w\  /w\  /w\  /w\  /w\  /w\  /|\  /w\  /|\  /|\  /|\  /|\  /w\  /w\  /|\  /|\  /w\  /w\  /w\  /w\  /w\  /w\  /|\
/'\  / \  /'\  /'\  / \  / \  /'\  /'\  /'\  /'\  /'\  /'\  / \  /'\  /'\  / \  /'\  /'\  /'\  /'\  / \  / \  /'\  /'\  / \  / \  / \  /'\  /'\  /'\  /'\

Some weights of DPRContextEncoder were not initialized from the model checkpoint at ./retriever_model/passage and are newly initialized: ['ctx_encoder.bert_model.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Updating embeddings for 145898 docs ...
Embedded 80 / 145898 texts
maxupp commented 3 years ago

And can I persist the model Haystack pulls from remote?

Do you mean the DPR models? If you supply local paths (as in your code snippet above), it should not pull anything else from remote 🤔

Nothing's pulled when i point to my local models. I asked whether I can persist the automatically retrieved models in some way, because I'm not sure my CLI-downloaded models match those that are pulled by haystack when no local path is supplied.

maxupp commented 3 years ago

Unfortunately I'm not able to test on the master branch right now, because I'm using your prebuilt docker images, and as fas as I can see, those aren't supplied for the current master.

Unfortunately I don't have the time to build the container myself right now.

tholor commented 3 years ago

Ok got it.

Some weights of DPRContextEncoder were not initialized [...] ['ctx_encoder.bert_model.embeddings.position_ids']

Not a problem and should be gone in latest master

Unfortunately I don't have the time to build the container myself right now.

Sure. We will do the next haystack release anyway very soon. Then you can try it with the new image.

I asked whether I can persist the automatically retrieved models in some way

There's a hackish way in the old 0.4.0 version, but I would rather suggest waiting for 0.5.0, where we'll have a clean save() / load() method for DPR.

tholor commented 3 years ago

@maxupp We've released Haystack 0.5.0 today and pushed new Docker images. Let me know if the performance problem still persists there ...

BTW: save() and load() got implemented in #542

maxupp commented 3 years ago

Haven't gotten to the performance issue yet, still fixing what broke when upgrading Haystack.

First I found is that the FARM Reader's default num_processes of None immediately leads to an error with the multiprocessing lib:


File "/usr/lib/python3.7/multiprocessing/pool.py", line 169, in __init__
qa_backend_1   |     raise ValueError("Number of processes must be at least 1")
qa_backend_1   | ValueError: Number of processes must be at least 1

I will keep you updated on my further encounters.

tholor commented 3 years ago

@maxupp I tried to reproduce your error but did not succeed :thinking: . I am not aware of any change in FARM's usage of multiprocessing that could explain this. The only trace that I see here: Are you running this on a machine with a single CPU core?

The default behaviour in FARM for num_processes=None is to use multiprocessing.cpu_count() - 1 processes. So maybe you try to spawn 0 processes here? Maybe we should change this to min(multiprocessing.cpu_count() - 1, 1).

However, this behaviour should not have changed with Haystack 0.5.0.

Happy to support on the debugging here or in case you face other problems with the latest version!

lalitpagaria commented 3 years ago

I think you mean max(multiprocessing.cpu_count() - 1, 1) Instead of min

tholor commented 3 years ago

yes, of course :)

tholor commented 3 years ago

In latest FARM (0.6.0), we avoid setting num_processes = 0 in case of a single CPU core: (see https://github.com/deepset-ai/FARM/blob/1f046cadf9e5090d93a9760593b9788e1fe012c1/farm/infer.py#L323). It will be available in Haystack once we upgraded to the latest FARM version (probably within the next week).

As I could not reproduce the slow speed here, I am closing this issue now. Feel free to re-open if the problem remains on your side with the latest Haystack version.