Closed JulesGM closed 2 years ago
Hi,
Wikipedia embeddings, obtained using the wikipedia version split into passages used in DPR, are now available for the different models:
https://dl.fbaipublicfiles.com/contriever/embeddings/contriever/wikipedia_embeddings.tar
https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar
https://dl.fbaipublicfiles.com/contriever/embeddings/mcontriever/wikipedia_embeddings.tar
https://dl.fbaipublicfiles.com/contriever/embeddings/mcontriever-msmarco/wikipedia_embeddings.tar
Gautier
@gizacard Thanks for sharing these!
I downloaded and extracted the first .tar file, resulting in:
total 31G
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_11
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_06
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_10
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_07
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_02
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_15
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_03
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_14
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_05
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_12
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_04
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_13
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_09
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_01
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_00
drwxrwxr-x 2 adivekar adivekar 6.0K May 30 2022 .
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30 2022 passages_08
drwxrwxr-x 3 adivekar adivekar 6.0K Mar 5 15:31 ..
How are you supposed to use these?
The following fails:
indexer = Indexer(768) ## BERT embedding size since Contriever uses BERT
indexer.deserialize_from('./wikipedia_embeddings/')
Error:
Loading index from ./wikipedia_embeddings/index.faiss, meta data from ./wikipedia_embeddings/index_meta.faiss
RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /home/conda/feedstock_root/build_artifacts/faiss-split_1644327811086/work/faiss/impl/io.cpp:67: Error: 'f' failed: could not open ./wikipedia_embeddings/index.faiss for reading: No such file or directory
Seems like the index.faiss
file is missing?
Okay I think I got it (through randomly guessing). It's an npz file, so you can load it with np.load()
Below is a loading script which can be invoked as follows:
contriever_wiki_embeddings: Dict[str, np.ndarray] = load_contriever_wiki_embeddings(
'~/contriever-wikipedia_embeddings/wikipedia_embeddings/' ## CHANGE THIS TO YOUR PATH TO THE CORRESPONDING FOLDER
)
Loading script:
from typing import *
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures._base import Future
import math, os, glob, numpy as np
def get_result(x):
if isinstance(x, Future):
return x.result()
return x
def accumulate(futures: Union[Tuple, List, Set, Dict, Any]) -> Union[Tuple, List, Set, Dict, Any]:
"""Join operation on a single future or a collection of futures."""
if isinstance(futures, list):
return [get_result(future) for future in futures]
elif isinstance(futures, tuple):
return tuple([get_result(future) for future in futures])
elif isinstance(futures, set):
return set([get_result(future) for future in futures])
elif isinstance(futures, dict):
return {k: get_result(v) for k, v in futures.items()}
else:
return get_result(futures)
def num_zeros_to_pad(max_i: int) -> int:
assert isinstance(max_i, int) and max_i >= 1
num_zeros = math.ceil(math.log10(max_i)) ## Ref: https://stackoverflow.com/a/51837162/4900327
if max_i == 10 ** num_zeros: ## If it is a power of 10
num_zeros += 1
return num_zeros
def pad_zeros(i: int, max_i: int = None) -> str:
assert isinstance(i, int) and i >= 0
if max_i is None:
return str(i)
assert isinstance(max_i, int) and max_i >= i
num_zeros: int = num_zeros_to_pad(max_i)
return f'{i:0{num_zeros}}'
def load_embeddings(passages_fpath: str) -> Dict[str, np.ndarray]:
npzdata: Tuple[List[str], np.ndarray] = np.load(
passages_fpath,
allow_pickle=True,
)
assert isinstance(npzdata, tuple) and len(npzdata) == 2, f'Expected 2-tuple, found: {type(npzdata)} with len {len(npzdata)}'
passage_embeddings: Dict[str, np.ndarray] = {
passage_i_str: passage_embedding
for passage_i_str, passage_embedding in zip(npzdata[0], list(npzdata[1]))
}
return passage_embeddings
def load_contriever_wiki_embeddings(
passages_dir_path: str,
max_num_files: int=int(1e9),
file_glob: str = 'passages_*',
num_threads: int = 127,
) -> Dict[str, np.ndarray]:
passages_fpaths: List[str] = sorted(glob.glob(os.path.join(passages_dir_path, file_glob)))
passages_fpaths: List[str] = passages_fpaths[:max_num_files] ## Load only `max_num_files`
num_threads: int = min(num_threads, max_num_files)
executor = ThreadPoolExecutor(max_workers=num_threads)
futures: Dict[str, Future] = {
passages_fpath: executor.submit(load_embeddings, passages_fpath=passages_fpath)
for passages_fpath in passages_fpaths
}
passage_embeddings: Dict[str, np.ndarray] = {}
for passages_fpath, fut in futures.items():
passage_embeddings.update(accumulate(fut))
print(f'Completed: {len(passage_embeddings)/1e6:.3f}MM passages')
return passage_embeddings
@gizacard follow-up question...where are the original passages which were used to get these embeddings?
Answering my own question above. HuggingFace datasets library has them: https://huggingface.co/datasets/wiki_dpr
from datasets import load_dataset
wiki_passages = load_dataset('wiki_dpr', 'psgs_w100.multiset.exact.no_embeddings')
wiki_passages['train'] ## This has an "id" column, match them to the keys of the `passage_embeddings` dict above
Hello folks, I appreciate this work quite a bit, congrats on the new state of the art on zero-shot retrieval.
I feel like something very helpful that DPR did for researchers in labs with smaller per-researcher compute was to host the key embeddings (and the FAISS index). This has the nice effect for everyone (including for me, as a M.sc. student) to promote research and reproducibility in the field.
As both your team and the DPR team are from Facebook research, it is likely possible for you folks as well. Just wondering :) Thanks.