Query time takes too long inferencing samples

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.63k stars 1.91k forks source link

Query time takes too long inferencing samples #2331

Closed HighDeFing closed 2 years ago

HighDeFing commented 2 years ago

This a reference to an issue I raised yesterday and got a response Originally posted by @ZanSara in https://github.com/deepset-ai/haystack/issues/2060#issuecomment-1072164793 .

Behavior:

When doing an extractive QA with dense passage retrieval it takes up to 15 minutes to give an answer.

I expected to take only some seconds or a couple of minutes.

Context

I'm using a python class to use haystack described it's initialization here, using dense passage retrieval in Spanish:

class Haystack_module():
    def __init__(self):
      self.document_store = ElasticsearchDocumentStore(similarity="dot_product")
              self.retriever = DensePassageRetriever(
              document_store=self.document_store,
              query_embedding_model="voidful/dpr-question_encoder-bert-base-multilingual",
              passage_embedding_model="sadakmed/dpr-passage_encoder-spanish",
              use_gpu=False
          )
              self.reader = FARMReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es", use_gpu=False)
              self.qa_pipe = ExtractiveQAPipeline(reader=self.reader, retriever=self.retriever)

I'm using ElasticSearch as document store, This is the way I write the files into the document store

converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["es"])
        docs = converter.convert(file_path=file_source, meta=meta_data)

      # [...] (some preprocesing as deleting '\n']
        self.document_store.write_documents(docs)
        self.document_store.update_embeddings(self.retriever)

Query

To get the answer I'm doing this inside the class:

def init_QAPipeline(self):
        self.qa_pipe = ExtractiveQAPipeline(reader=self.reader, retriever=self.retriever)

Just call the qa_pipe form the class and do a query like this:

query = '¿Qué es un adolescente?'
result = elastic_pipe.run(query=query, params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 3}})
print(result)

System

I'm using a virtual machine with host (windows 11) and guest ubuntu 20.04. The specs for the guest are:

OS: Ubuntu 20.04
GPU/CPU: No GPU available as it's a virtual machine, CPU: I'm allocating 12 cores with intel i7-12700K 3.61 GHz.
Haystack version (commit or version number): 1.2.0
DocumentStore: ElasticSearch
Reader: FARMReader("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es"
Retriever: DensePassageRetriever
- query_embedding_model="voidful/dpr-question_encoder-bert-base-multilingual"
- passage_embedding_model="sadakmed/dpr-passage_encoder-spanish",

Any help is extremely welcome, Thank you.

TuanaCelik commented 2 years ago

Hi @HighDeFing - thank you for sending these over. I am going to have a look and try to help you out here. First off, thank you for the code snippets above but if you can could you send over a fully reproducible example? Maybe as a single file and if possible a link to the documents you're writing to the docstore or some specs about them?

HighDeFing commented 2 years ago

Yes, this is my project repository, https://github.com/HighDeFing/thesis_v4 To run:

python3 -m venv env
source env/bin/activate
env/bin/python3 setup.py install
pip install -e .
pip install -r requirements.txt (some requirements might be wrong apologies)
pip install Unidecode

All files are on thesis_pdfs, as of now I think you can run this code with.

Uploading the files in elastic search

Start elastic search with

docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.0
docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.0

To run docker again: docker start es01-test -a
executing this python script. https://github.com/HighDeFing/thesis_v4/blob/main/scripts/haystack_files/haystack_upload_files.py

$ (env) /thesis_v4: env/bin/python3 haystack_upload_files.py

After the files are uploaded I use this interface to make queries:

Right now it works with a predetermined query:

To run:

Go to the folder of fast_api here https://github.com/HighDeFing/thesis_v4/blob/main/scripts/fastapi/main.py
to execute uvicorn main:app --reload (might need uvicorn)
The website url is http://127.0.0.1:8000/search.html

Click buscar and it should give you a response. Screenshot 2022-03-22 101456

Hope this helps, any questions please feel free to give me.

ZanSara commented 2 years ago

Hey @HighDeFing, sorry but that's way too much code for us to parse. What @TuanaCelik was asking was a Minimal Reproducible Example: please read this great description of it by the StackOverflow people to understand a bit better what it is exactly https://stackoverflow.com/help/minimal-reproducible-example

We understand that it takes time to put together such an example, but the advantage is that in the process of making it, many other issues might come up and solve themselves. Minimal reproducible examples are a great debugging tool on their own right :slightly_smiling_face:

Unfortunately we can't really help you further otherwise. By your machine specs you should be able to get answers within 1-2 mins maximum. The only thing that might be taking so long is if you're re-indexing all your documents at every query, or other similar bugs, but without a simple, single-file example of the issue it's impossible to tell. I hope you understand.

HighDeFing commented 2 years ago

@ZanSara Oh okey, I will get back to you then as soon as I can. Thank you!

TuanaCelik commented 2 years ago

@HighDeFing In the meantime we've released a new version of Haystack (1.3.0). So you could try upgrade your Haystack package and see if this helps.

HighDeFing commented 2 years ago

I made a reproducible example here using version 1.3.0 of haystack. It's a more compacted version of my code: https://github.com/HighDeFing/repro_haystack. Every instruction is on the readme.md on how to run

I still encounter the same issue that queries take too long. Indexing the documents took me like 1.5 hours. I use Elasticsearch version: 7.13.0 in a docker container.

Please tell me if I should use less files for the example or for any questions, really appreciate the help. Thank you.

ZanSara commented 2 years ago

Thanks, I'll look into it.

First off I noticed that you're using a rather large model. Have you tried using https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es instead? Does the query time go down, and by how much?

You could also try to use "voidful/dpr-ctx_encoder-bert-base-multilingual" as your passage embedding model.

ZanSara commented 2 years ago

You also don't seem to reuse the pipeline in your test. Have you made sure that in your bigger project the pipeline is reused?

Another thing you could try is to check how long the documents are in the document store. Use document_store.get_documents() with a filter to retrieve just a few and make sure they contain only a few sentences each.

I hope these suggestions can help you nail down the bottleneck. Consider, though, that without a GPU DPR will always take 1-2 mins to reply. For faster queries, in your setup I recommend ElasticsearchRetriever.

HighDeFing commented 2 years ago

Okay I stared using Ubuntu as dual boot instead of a virtual machine so I can take advantage of all my CPU cores, I managed with 19 workers to get it to 10 mins with my initial models instead of 15 mins

With the changes you proposed to the models (and also running with the full hardware instead of VM) I managed to get and indexing time of 185 documents from 1.5 hours to 1 hour. And query time from 10 mins to 5 mins.

Another thing you could try is to check how long the documents are in the document store. Use document_store.get_documents() with a filter to retrieve just a few and make sure they contain only a few sentences each.

My documents have like 30,000 words each, they are thesis or dissertation for academic, and I think this is the core of the problem now that you mentioned. Any way to get them to smaller passages so I have better query times and also better accuracy?

Also about the GPU it seems haystack runs on a older version of pytorch 1.10.2 and my GPU is too new I get this error when putting use_gpu=True:

NVIDIA GeForce RTX 3070 Ti with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.

It eventually fails with this RuntimeError: CUDA error: no kernel image is available for execution on the device and gives me no answers.

I can stick with CPU is fine, just need help with splitting the documents since that looks to be the core problem thank you very much.

For my full project I was intended to use around 8000 documents with 25,000-40,000 words each.

Updating my system now since I'm not longer running it in a VM:

System

OS: Ubuntu 20.04
GPU: GTX 3070ti
CPU: intel i7-12700K 3.61 GHz (12 cores, 20 threads)
Haystack version (commit or version number): 1.3.0
DocumentStore: ElasticSearch 7.13.0
Reader: mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es
Retriever:
- query_embedding_model="voidful/dpr-question_encoder-bert-base-multilingual",
- passage_embedding_model="voidful/dpr-ctx_encoder-bert-base-multilingual"

HighDeFing commented 2 years ago

I applied the processor function with this parameters to get smaller passages:

self.processor = PreProcessor(
            clean_empty_lines=True,
            clean_whitespace=True,
            split_by="passage",
            split_length=200,
            split_respect_sentence_boundary=False,
            split_overlap=0,
            language="es"
        )

Before indexing the documents in ES. And now queries times with 185 thesis (1427 documents in total in ES) takes 1:30 minutes. Looks a lot better now.

Embedding time took more than 5+ hours though, went to work and was only finished when I came back. 🤣

ZanSara commented 2 years ago

Hey @HighDeFing, great to hear that the query time looks reasonable now! :tada: 1-2 minutes is well within what I consider normal for a pipeline like yours, on your hardware. Unfortunately the same is true for the indexing part: with the amount of documents you have, there's no surprise it takes several hours.

If you manage to get the GPU going, your query time could go down to seconds and the indexing should also speed up considerably. Although Haystack requires pytorch 1.10, you can try to upgrade it anyway and retry: we keep that version only because we haven't tested the new releases yet, but there's a good chance it will all work with a newer version too.

Let us know if it is the case! If so, that would be a nice push for us to support newer PyTorch versions too :slightly_smiling_face:

TuanaCelik commented 2 years ago

@HighDeFing great to see that the preprocessor step helped you so much. I had a problem with a demo this week which ended up being solved exactly like this too! I think we both ended up learning about the same problem with large documents during the same week 😄

HighDeFing commented 2 years ago

Let us know if it is the case! If so, that would be a nice push for us to support newer PyTorch versions too slightly_smiling_face

Okey had to download pyTorch 1.11.0 with CUDA 11.3 pip3 install --upgrade torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113 to work with my gtx 3070ti.

But the queries work fine and everything seems compatible, from 1.5 minutes to 5-10 seconds!!. 185 thesis (1427 documents in total in ES) 😲

Let's see how embedding time does this time also. I'll keep updating.