Closed HighDeFing closed 2 years ago
Hi @HighDeFing - thank you for sending these over. I am going to have a look and try to help you out here. First off, thank you for the code snippets above but if you can could you send over a fully reproducible example? Maybe as a single file and if possible a link to the documents you're writing to the docstore or some specs about them?
Yes, this is my project repository, https://github.com/HighDeFing/thesis_v4 To run:
python3 -m venv env
source env/bin/activate
env/bin/python3 setup.py install
pip install -e .
pip install -r requirements.txt
(some requirements might be wrong apologies) pip install Unidecode
All files are on thesis_pdfs, as of now I think you can run this code with.
docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.0
docker run --name es01-test --net elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.0
docker start es01-test -a
$ (env) /thesis_v4: env/bin/python3 haystack_upload_files.py
Right now it works with a predetermined query:
To run:
uvicorn main:app --reload
(might need uvicorn)http://127.0.0.1:8000/search.html
Click buscar and it should give you a response.
Hope this helps, any questions please feel free to give me.
Hey @HighDeFing, sorry but that's way too much code for us to parse. What @TuanaCelik was asking was a Minimal Reproducible Example: please read this great description of it by the StackOverflow people to understand a bit better what it is exactly https://stackoverflow.com/help/minimal-reproducible-example
We understand that it takes time to put together such an example, but the advantage is that in the process of making it, many other issues might come up and solve themselves. Minimal reproducible examples are a great debugging tool on their own right :slightly_smiling_face:
Unfortunately we can't really help you further otherwise. By your machine specs you should be able to get answers within 1-2 mins maximum. The only thing that might be taking so long is if you're re-indexing all your documents at every query, or other similar bugs, but without a simple, single-file example of the issue it's impossible to tell. I hope you understand.
@ZanSara Oh okey, I will get back to you then as soon as I can. Thank you!
@HighDeFing In the meantime we've released a new version of Haystack (1.3.0). So you could try upgrade your Haystack package and see if this helps.
I made a reproducible example here using version 1.3.0
of haystack. It's a more compacted version of my code: https://github.com/HighDeFing/repro_haystack. Every instruction is on the readme.md on how to run
I still encounter the same issue that queries take too long. Indexing the documents took me like 1.5 hours. I use Elasticsearch version: 7.13.0
in a docker container.
Please tell me if I should use less files for the example or for any questions, really appreciate the help. Thank you.
Thanks, I'll look into it.
First off I noticed that you're using a rather large model. Have you tried using https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es instead? Does the query time go down, and by how much?
You could also try to use "voidful/dpr-ctx_encoder-bert-base-multilingual" as your passage embedding model.
You also don't seem to reuse the pipeline in your test. Have you made sure that in your bigger project the pipeline is reused?
Another thing you could try is to check how long the documents are in the document store. Use document_store.get_documents()
with a filter to retrieve just a few and make sure they contain only a few sentences each.
I hope these suggestions can help you nail down the bottleneck. Consider, though, that without a GPU DPR will always take 1-2 mins to reply. For faster queries, in your setup I recommend ElasticsearchRetriever
.
Okay I stared using Ubuntu as dual boot instead of a virtual machine so I can take advantage of all my CPU cores, I managed with 19 workers to get it to 10 mins with my initial models instead of 15 mins
With the changes you proposed to the models (and also running with the full hardware instead of VM) I managed to get and indexing time of 185 documents from 1.5 hours to 1 hour. And query time from 10 mins to 5 mins.
Another thing you could try is to check how long the documents are in the document store. Use document_store.get_documents() with a filter to retrieve just a few and make sure they contain only a few sentences each.
My documents have like 30,000 words each, they are thesis or dissertation for academic, and I think this is the core of the problem now that you mentioned. Any way to get them to smaller passages so I have better query times and also better accuracy?
Also about the GPU it seems haystack runs on a older version of pytorch 1.10.2
and my GPU is too new I get this error when putting use_gpu=True
:
NVIDIA GeForce RTX 3070 Ti with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
It eventually fails with this RuntimeError: CUDA error: no kernel image is available for execution on the device
and gives me no answers.
I can stick with CPU is fine, just need help with splitting the documents since that looks to be the core problem thank you very much.
For my full project I was intended to use around 8000 documents with 25,000-40,000 words each.
Updating my system now since I'm not longer running it in a VM:
1.3.0
7.13.0
query_embedding_model
="voidful/dpr-question_encoder-bert-base-multilingual",passage_embedding_model
="voidful/dpr-ctx_encoder-bert-base-multilingual"I applied the processor function with this parameters to get smaller passages:
self.processor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
split_by="passage",
split_length=200,
split_respect_sentence_boundary=False,
split_overlap=0,
language="es"
)
Before indexing the documents in ES. And now queries times with 185 thesis (1427 documents in total in ES) takes 1:30 minutes. Looks a lot better now.
Embedding time took more than 5+ hours though, went to work and was only finished when I came back. 🤣
Hey @HighDeFing, great to hear that the query time looks reasonable now! :tada: 1-2 minutes is well within what I consider normal for a pipeline like yours, on your hardware. Unfortunately the same is true for the indexing part: with the amount of documents you have, there's no surprise it takes several hours.
If you manage to get the GPU going, your query time could go down to seconds and the indexing should also speed up considerably. Although Haystack requires pytorch 1.10, you can try to upgrade it anyway and retry: we keep that version only because we haven't tested the new releases yet, but there's a good chance it will all work with a newer version too.
Let us know if it is the case! If so, that would be a nice push for us to support newer PyTorch versions too :slightly_smiling_face:
@HighDeFing great to see that the preprocessor step helped you so much. I had a problem with a demo this week which ended up being solved exactly like this too! I think we both ended up learning about the same problem with large documents during the same week 😄
Let us know if it is the case! If so, that would be a nice push for us to support newer PyTorch versions too slightly_smiling_face
Okey had to download pyTorch 1.11.0
with CUDA 11.3 pip3 install --upgrade torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
to work with my gtx 3070ti.
But the queries work fine and everything seems compatible, from 1.5 minutes to 5-10 seconds!!. 185 thesis (1427 documents in total in ES) 😲
Let's see how embedding time does this time also. I'll keep updating.
This a reference to an issue I raised yesterday and got a response Originally posted by @ZanSara in https://github.com/deepset-ai/haystack/issues/2060#issuecomment-1072164793 .
Behavior:
When doing an extractive QA with dense passage retrieval it takes up to 15 minutes to give an answer.
I expected to take only some seconds or a couple of minutes.
Context
I'm using a python class to use haystack described it's initialization here, using dense passage retrieval in Spanish:
I'm using ElasticSearch as document store, This is the way I write the files into the document store
Query
To get the answer I'm doing this inside the class:
Just call the
qa_pipe
form the class and do a query like this:System
I'm using a virtual machine with host (windows 11) and guest ubuntu 20.04. The specs for the guest are:
FARMReader
("mrm8488/bert-base-spanish-wwm-cased-finetuned-spa-squad2-es"query_embedding_model
="voidful/dpr-question_encoder-bert-base-multilingual"passage_embedding_model
="sadakmed/dpr-passage_encoder-spanish",Any help is extremely welcome, Thank you.