Open abhinavkulkarni opened 2 years ago
In BLINK's biencoder, score_candidates
is the dot product operation on two matrices, but the native IndexHNSWFlat
only supports L2 distance.
The idea of BLINK's encapsulation of faiss is: to transform the dot product space into an L2 space, by adding an extra dimension and making some mathematical transformations.
However, BLINK's source code doesn't use PCA to reduce the dimensionality of the vectors. I noticed you mentioned that you've performed a dimension reduction.
But when I actually used it, I encountered problems. I didn't use BLINK's approach, but instead normalized the vectors with L2 normalization. I've successfully trained the faiss index and got the correct search results.
import faiss
index = faiss.index_factory(768, "L2norm,IVF16384_HNSW32,Flat")
index.train(embeddings.numpy())
index.add(embeddings.numpy())
However, when I applied PCA to reduce the vector dimensionality, I couldn't get the correct results, whether I reduced dimensions before normalization or vice versa.
index = faiss.index_factory(768, "PCA256,L2norm,IVF16384_HNSW32,Flat")
What I want to ask is: is your method the same as BLINK's? However, if so, adding new content requires re-indexing the entire KB because the maximum norm might change. Or, like me, do you normalize the vectors? Where might my approach have gone wrong that led to incorrect search results?
Hi,
I have seen a lot of comments asking how to create a custom dataset or how to use a smaller, different BERT base model for biencoder or how to modify certain hyperparameters (such as context length), so I have decided to write a small tutorial for the same.
First of all, here's what the data directory structure looks like:
documents.jsonl
is the file containing all the candidates. Here's how it looks like:document_id
is some kind of identifier for the document, in my case, it is the Wikipediapage_id
. For e.g. 2nd document refers to https://en.wikipedia.org/?curid=7412236.Here's how train/test/valid data looks like:
The
label_id
corresponds to the line number (0-indexed) for the label indocuments.jsonl
. For e.g.,(
sed
uses 1-indexing)You can train a biencoder model as follows:
The training script output is pretty self-explanatory and you should be able to verify that the model is indeed making progress from one evaluation round to the next. I would highly recommend using a subset of the data and more frequent evaluation rounds to verify that the model training is indeed progressing well.
Please note, the
main
branch uses an older transformers library calledpytorch-transformers
. In order to use any of the HuggingFace base BERT model (such asgoogle/bert_uncased_L-8_H-512_A-8
as above), you'll have to make minor changes to the BLINK codebase:pytorch-transformers
withtransformers
AutoModel
andAutoTokenizer
instead ofBertModel
andBertTokenizer
, for e.g. inbiencoder.py
and
ranker_base.py
There are minor issues (such as correct placement of data on CPU/GPU devices, freeing up GPU memory periodically, etc.) - please look at open pull requests and search through reported issues to fix those.
In my case, I also had to modify train/test/valid torch datasets and dataloaders. The ones in the main branch load all the data in memory, causing OOM errors. I created my own
IterableDataset
to read data on the fly. If you have a multi-core CPU, use multiple workers to feed data to the model on GPU.I was able to train a much smaller
google/bert_uncased_L-8_H-512_A-8
instead ofbert-large-uncased
(159MB vs 1.25GB) model on my custom dataset on a much smaller, older GPU (Nvidia GeForce GTX 1060 with 6GB of GPU memory).After creating a FAISS index of candidate encodings with dimensionality reduction (512 => 384) (PCA or OPQ) and coarse- and fine-grained product quantization, I am able to run the model relatively quickly on CPU with good accuracy.
Thanks FB research team for the great effort!