castorini / ura-projects

0 stars 1 forks source link

DPR inside the web browser? #7

Open lintool opened 10 months ago

lintool commented 10 months ago

With https://www.tensorflow.org/js we can potentially run DPR in the web browser?

We've done stuff like this before... https://arxiv.org/abs/1810.12859

lintool commented 10 months ago

Just came across this: https://www.leebutterman.com/2023/06/01/offline-realtime-embedding-search.html

AlexStan0 commented 5 months ago

Is this taken? If not, I'd be interested on working on it

lintool commented 4 months ago

Also: https://huggingface.co/docs/transformers.js/index

Transformers.js uses ONNX Runtime to run models in the browser. The best part about it, is that you can easily convert your pretrained PyTorch, TensorFlow, or JAX models to ONNX using 🤗 Optimum.

We already use ONNX in Anserini... e.g., https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-msmarco-passage-bge-base-en-v1.5-hnsw-int8-onnx.md

AlexStan0 commented 2 months ago

Got it working, although I did not use transformers.js and instead used its python counterpart with a flask server and the DPR gold passages nq_dev.json data set.

You will need to clone the dpr in browser repository and set up a conda python 3.8.18 environement for this work.

The conda environement can be setup as such:

conda create --name <environement name> python=3.8.18
conda activate <environement name>

After doing so, you will need to run the data.ipynb notebook, this will prepare the DPR data set. Make sure to have the conda python 3.8.18 environement selected as the kernel, you may be asked to install ipykernel.

After all that, you can run pip install -r requirements.txt to install the requirements. Once that is all done you can start the server. It will not load the entire data set and instead load 10 batches of size 100. This should take around 10 minutes but YMMV depending on your system resources.

Once that is done, you should be able to open up the website and type out queries. The page shows the top 5 passages with the highest MRR@100 for the query