Closed nsankar closed 4 years ago
Hey @nsankar,
ScaNN is definitely on our radar. The benchmarks look impressive. The only doubts I currently have are on stability and user-friendliness. From a first look into the codebase, ScANN seems to target research rather than production environments and the installation seems a bit heavier as it requires TensorFlow plus compilation from source for some OS. Have you already tested it yourself?
When we discussed the next DocumentStores for large scale vector similarity in Haystack, we thought it will be good to have:
As a way forward I would suggest to:
What do you think?
@tholor Got it. As you nailed it, At present I believe FAISS is the way to go followed by Milvus. About Jina , I am not sure. I personally feel it is a bit complicated than Milvus in the way it is structured.
@tholor Once the Faiss implementation is completed (before a release), please let me know in this thread with the basic instructions. I can test it and give any observations or feedback.
Hi @nsankar, thank you for the offer. Any feedback/observations would be highly appreciated.
The FAISS implementation was added by #253.
You can try it out with the tutorial 6(Tutorial6_Better_Retrieval_via_DPR.py). The current tutorial uses a ElasticsearchDocumentStore
for storing the embeddings. You can replace it with FAISSDocumentStore
here:
from haystack.database.faiss import FAISSDocumentStore
document_store = FAISSDocumentStore()
We'll update the tutorials in the next days. Happy to help if you face any issues along the way.
@tanaysoni Ok. will try it. Thanks . One of the things in my viewpoint is probably there should be a way to connect to a remote FAISS datastore (just like having a separate elastic search server host) for having a scalable and performant deployment of Haystack with FAISS in a distributed cloud infrastructure with suitable cloud server instances say on Amazon. This may be done using a FAISS gRPC serving https://github.com/louiezzang/faiss-server or using Milvus https://github.com/milvus-io/pymilvus#install-pymilvus . Milvus document site indicates, it integrates with FAISS. https://milvus.io/docs/overview.md . But I dont see the details. what do you think?
@tanaysoni There seems to be some issues.
I had removed faiss-cpu and installed faiss-gpu . I had installed sqlite3 and created a db named as newdb From a simple pdf file containing just 3 paragraphs of text, I extracted the text and created a dictionary with list size - 11 called as dict_list, which I used to test earlier also...
Next , I created the Faiss doc. store passing the sqlite db string as follows and then calling write_documents. This worked without any issues.
from haystack.database.faiss import FAISSDocumentStore
document_store = FAISSDocumentStore(sql_url="sqlite:///newdb.db")
document_store.write_documents(dict_list)
Next , the following code block in the colab cell was executed
from haystack.retriever.dense import DensePassageRetriever
retriever = DensePassageRetriever(document_store=document_store, embedding_model="dpr-bert-base-nq",do_lower_case=True, use_gpu=True)
document_store.update_embeddings(retriever)
This keeps running for a long time like for more than 10 minutes. I then interrupted the run... I had tried this a few times and it is the same observation. There seems to be an issue.
The outputs of this block till I interrupted the colab cell is as below.,
08/11/2020 10:24:05 - INFO - haystack.retriever.dpr_utils - Loading saved model from models/dpr/checkpoint/retriever/single/nq/bert-base-encoder.cp
08/11/2020 10:24:05 - INFO - haystack.retriever.dense - Loaded encoder params: {'do_lower_case': True, 'pretrained_model_cfg': 'bert-base-uncased', 'encoder_model_type': 'hf_bert', 'pretrained_file': None, 'projection_dim': 0, 'sequence_length': 256}
08/11/2020 10:24:15 - INFO - haystack.retriever.dense - Loading saved model state ...
08/11/2020 10:24:15 - INFO - haystack.retriever.dense - Loading saved model state ...
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-46-f4d958d754e6> in <module>()
12 # At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
13
---> 14 document_store.update_embeddings(retriever)
15
16 # ES retreivar
14 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
1610 ret = torch.addmm(bias, input, weight.t())
1611 else:
-> 1612 output = input.matmul(weight.t())
1613 if bias is not None:
1614 output += bias
KeyboardInterrupt:
Another observation that I have is , when I didn't input any sqllite db url string to the faiss document store, it didn't complain and the above block kept running for a long period.. same observation as above..
Hi @nsankar, it could be possible that it's taking time due to large number of documents. To rule that out, can you try indexing fewer documents, i.e., document_store.write_documents(dict_list[:10])
. Ensure that the existing database newdb.db
is deleted before continuing.
when I didn't input any sqllite db url string to the faiss document store, it didn't complain
When the sql_url
parameter is unspecified, a transient in-memory SQLite database is created by default.
@tanaysoni As I mentioned , the document size was very small . that is len(dict_list) was 11 . just 11 text entries. I will check it out again
@tanaysoni It works now after a fresh install of Haystack. I used the FarmReader. Its the same thing that I did earlier.
In colab, when I tried using txReader = TransformersReader() , I get the following CUDA driver error. I am going to try and upgrade to Pytorch 1.6 and try . Do you have any suggestions ? Thanks.
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-18-72e07f951b35> in <module>()
----> 9 txReader = TransformersReader()
/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py in _check_driver()
61 Alternatively, go to: https://pytorch.org to install
62 a PyTorch version that has been compiled with your version
---> 63 of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
64
65
AssertionError:
The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.
Hi @nsankar, installing PyTorch manually with !pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
in Colab should resolve this issue.
The tutorials are also now updated with #322.
I'm closing this thread, but please feel free to update if you still face this issue.
Is your feature request related to a problem? Please describe. Feature Enhancement in Haystack for an efficient way to work with large document embeddings
Describe the solution you'd like Google's newly open sourced ScaNN is more accurate and has a better performance than FAISS . Hence it may be worthwhile integrating it with Haystack
Describe alternatives you've considered ScaNN vis-a-vis FAISS
Additional context https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html https://github.com/google-research/google-research/tree/master/scann ScaNN can be configured to fit datasets with different sizes and distributions. It has both TensorFlow and Python APIs. The library shows strong performance with large datasets
┆Issue is synchronized with this Jira Task by Unito