Integrate Google ScaNN for efficiently working with large size embeddings

nsankar commented 4 years ago

Is your feature request related to a problem? Please describe. Feature Enhancement in Haystack for an efficient way to work with large document embeddings

Describe the solution you'd like Google's newly open sourced ScaNN is more accurate and has a better performance than FAISS . Hence it may be worthwhile integrating it with Haystack

Describe alternatives you've considered ScaNN vis-a-vis FAISS

Additional context https://ai.googleblog.com/2020/07/announcing-scann-efficient-vector.html https://github.com/google-research/google-research/tree/master/scann ScaNN can be configured to fit datasets with different sizes and distributions. It has both TensorFlow and Python APIs. The library shows strong performance with large datasets

┆Issue is synchronized with this Jira Task by Unito

tholor commented 4 years ago

Hey @nsankar,

ScaNN is definitely on our radar. The benchmarks look impressive. The only doubts I currently have are on stability and user-friendliness. From a first look into the codebase, ScANN seems to target research rather than production environments and the installation seems a bit heavier as it requires TensorFlow plus compilation from source for some OS. Have you already tested it yourself?

When we discussed the next DocumentStores for large scale vector similarity in Haystack, we thought it will be good to have:

One lightweight option that is easy to run (e.g. without additional docker) and can be used easily in tutorials, notebooks or restrictive enterprise environment. The best candidate that we see here is FAISS and we are currently implementing it in #253.
One powerful option that targets large scale production deployments. We currently see Milvus or Jina here. Both provide many helpful features for production and wrap several ANN algos. Milvus might also support ScANN soon (see https://github.com/milvus-io/milvus/issues/2771).

As a way forward I would suggest to:

Finish FAISS integration as a baseline
Evaluate usability + stability of ScANN and then either replace FAISS or offer it indirectly via a "MilvusDocumentStore"

What do you think?

nsankar commented 4 years ago

@tholor Got it. As you nailed it, At present I believe FAISS is the way to go followed by Milvus. About Jina , I am not sure. I personally feel it is a bit complicated than Milvus in the way it is structured.

nsankar commented 4 years ago

@tholor Once the Faiss implementation is completed (before a release), please let me know in this thread with the basic instructions. I can test it and give any observations or feedback.

tanaysoni commented 4 years ago

Hi @nsankar, thank you for the offer. Any feedback/observations would be highly appreciated.

The FAISS implementation was added by #253.

You can try it out with the tutorial 6(Tutorial6_Better_Retrieval_via_DPR.py). The current tutorial uses a ElasticsearchDocumentStore for storing the embeddings. You can replace it with FAISSDocumentStore here:

from haystack.database.faiss import FAISSDocumentStore
document_store = FAISSDocumentStore()

We'll update the tutorials in the next days. Happy to help if you face any issues along the way.

nsankar commented 4 years ago

@tanaysoni Ok. will try it. Thanks . One of the things in my viewpoint is probably there should be a way to connect to a remote FAISS datastore (just like having a separate elastic search server host) for having a scalable and performant deployment of Haystack with FAISS in a distributed cloud infrastructure with suitable cloud server instances say on Amazon. This may be done using a FAISS gRPC serving https://github.com/louiezzang/faiss-server or using Milvus https://github.com/milvus-io/pymilvus#install-pymilvus . Milvus document site indicates, it integrates with FAISS. https://milvus.io/docs/overview.md . But I dont see the details. what do you think?

nsankar commented 4 years ago

@tanaysoni There seems to be some issues.

I had removed faiss-cpu and installed faiss-gpu . I had installed sqlite3 and created a db named as newdb From a simple pdf file containing just 3 paragraphs of text, I extracted the text and created a dictionary with list size - 11 called as dict_list, which I used to test earlier also...

Next , I created the Faiss doc. store passing the sqlite db string as follows and then calling write_documents. This worked without any issues.

from haystack.database.faiss import FAISSDocumentStore
document_store = FAISSDocumentStore(sql_url="sqlite:///newdb.db")
document_store.write_documents(dict_list)

Next , the following code block in the colab cell was executed

from haystack.retriever.dense import DensePassageRetriever

retriever = DensePassageRetriever(document_store=document_store, embedding_model="dpr-bert-base-nq",do_lower_case=True, use_gpu=True)
document_store.update_embeddings(retriever)

This keeps running for a long time like for more than 10 minutes. I then interrupted the run... I had tried this a few times and it is the same observation. There seems to be an issue.

The outputs of this block till I interrupted the colab cell is as below.,

08/11/2020 10:24:05 - INFO - haystack.retriever.dpr_utils -   Loading saved model from models/dpr/checkpoint/retriever/single/nq/bert-base-encoder.cp
08/11/2020 10:24:05 - INFO - haystack.retriever.dense -   Loaded encoder params:  {'do_lower_case': True, 'pretrained_model_cfg': 'bert-base-uncased', 'encoder_model_type': 'hf_bert', 'pretrained_file': None, 'projection_dim': 0, 'sequence_length': 256}
08/11/2020 10:24:15 - INFO - haystack.retriever.dense -   Loading saved model state ...
08/11/2020 10:24:15 - INFO - haystack.retriever.dense -   Loading saved model state ...
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-46-f4d958d754e6> in <module>()
     12 # At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
     13 
---> 14 document_store.update_embeddings(retriever)
     15 
     16 # ES retreivar

14 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
   1610         ret = torch.addmm(bias, input, weight.t())
   1611     else:
-> 1612         output = input.matmul(weight.t())
   1613         if bias is not None:
   1614             output += bias

KeyboardInterrupt:

Another observation that I have is , when I didn't input any sqllite db url string to the faiss document store, it didn't complain and the above block kept running for a long period.. same observation as above..

tanaysoni commented 4 years ago

Hi @nsankar, it could be possible that it's taking time due to large number of documents. To rule that out, can you try indexing fewer documents, i.e., document_store.write_documents(dict_list[:10]). Ensure that the existing database newdb.db is deleted before continuing.

when I didn't input any sqllite db url string to the faiss document store, it didn't complain

When the sql_url parameter is unspecified, a transient in-memory SQLite database is created by default.

nsankar commented 4 years ago

@tanaysoni As I mentioned , the document size was very small . that is len(dict_list) was 11 . just 11 text entries. I will check it out again

nsankar commented 4 years ago

@tanaysoni It works now after a fresh install of Haystack. I used the FarmReader. Its the same thing that I did earlier.

In colab, when I tried using txReader = TransformersReader() , I get the following CUDA driver error. I am going to try and upgrade to Pytorch 1.6 and try . Do you have any suggestions ? Thanks.

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-18-72e07f951b35> in <module>()

----> 9 txReader = TransformersReader()
/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py in _check_driver()
     61 Alternatively, go to: https://pytorch.org to install
     62 a PyTorch version that has been compiled with your version
---> 63 of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
     64 
     65 

AssertionError: 
The NVIDIA driver on your system is too old (found version 10010).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

tanaysoni commented 4 years ago

Hi @nsankar, installing PyTorch manually with !pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html in Colab should resolve this issue.

The tutorials are also now updated with #322.

I'm closing this thread, but please feel free to update if you still face this issue.

deepset-ai / haystack

Integrate Google ScaNN for efficiently working with large size embeddings #281