OpenBMB / ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
https://openbmb.github.io/ToolBench/
Apache License 2.0
4.77k stars 402 forks source link

retriever corpus embedding cache #235

Open biosfood opened 8 months ago

biosfood commented 8 months ago

When using the inference pipeline, the retriever corpus needs to be rebuilt every time making debugging very time consuming. Instead, there should be an option to use a (precomputed) set of embeddings.

I will gladly create a PR if you find these changes appropriate.

Usage might look something like this:

--corpus_tsv_path data/retrieval/G1/corpus.tsv
--corpus_cache_path data/retrieval/G1/corpus_cache/

Then, the program can copute a hash of the corpus.tsv and look in the corpus_cache directory to find a file and load the embeddings from there if an appropriate file exists. Otherwise, compute the embeddings anyway and store them.

It is appropiate to load the corpus.tsv into memory in the first place, because we also need access to the plain text, meaning no real "excess work" is being done when loading it and looking up the cache file.

I got this to work by storing just the corpus_embedding as a .tf file:

torch.load("corpus_embeddings.pt")