Host Wikipedia Keys like DPR? - Githubissues

facebookresearch / contriever

Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning

Other

673 stars 59 forks source link

Host Wikipedia Keys like DPR? #2

Closed JulesGM closed 2 years ago

JulesGM commented 2 years ago

Hello folks, I appreciate this work quite a bit, congrats on the new state of the art on zero-shot retrieval.

I feel like something very helpful that DPR did for researchers in labs with smaller per-researcher compute was to host the key embeddings (and the FAISS index). This has the nice effect for everyone (including for me, as a M.sc. student) to promote research and reproducibility in the field.

As both your team and the DPR team are from Facebook research, it is likely possible for you folks as well. Just wondering :) Thanks.

gizacard commented 2 years ago

Hi,

Wikipedia embeddings, obtained using the wikipedia version split into passages used in DPR, are now available for the different models:

https://dl.fbaipublicfiles.com/contriever/embeddings/contriever/wikipedia_embeddings.tar
https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar
https://dl.fbaipublicfiles.com/contriever/embeddings/mcontriever/wikipedia_embeddings.tar
https://dl.fbaipublicfiles.com/contriever/embeddings/mcontriever-msmarco/wikipedia_embeddings.tar

Gautier

adivekar-utexas commented 1 year ago

@gizacard Thanks for sharing these!

I downloaded and extracted the first .tar file, resulting in:

total 31G
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_11
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_06
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_10
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_07
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_02
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_15
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_03
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_14
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_05
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_12
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_04
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_13
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_09
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_01
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_00
drwxrwxr-x 2 adivekar adivekar 6.0K May 30  2022 .
-rw-rw-r-- 1 adivekar adivekar 1.9G May 30  2022 passages_08
drwxrwxr-x 3 adivekar adivekar 6.0K Mar  5 15:31 ..

How are you supposed to use these?

adivekar-utexas commented 1 year ago

The following fails:

indexer = Indexer(768)  ## BERT embedding size since Contriever uses BERT
indexer.deserialize_from('./wikipedia_embeddings/')

Error:

Loading index from ./wikipedia_embeddings/index.faiss, meta data from ./wikipedia_embeddings/index_meta.faiss
RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /home/conda/feedstock_root/build_artifacts/faiss-split_1644327811086/work/faiss/impl/io.cpp:67: Error: 'f' failed: could not open ./wikipedia_embeddings/index.faiss for reading: No such file or directory

Seems like the index.faiss file is missing?

adivekar-utexas commented 1 year ago

Okay I think I got it (through randomly guessing). It's an npz file, so you can load it with np.load()

Below is a loading script which can be invoked as follows:

contriever_wiki_embeddings: Dict[str, np.ndarray] = load_contriever_wiki_embeddings(
    '~/contriever-wikipedia_embeddings/wikipedia_embeddings/'   ## CHANGE THIS TO YOUR PATH TO THE CORRESPONDING FOLDER
)

Loading script:

from typing import *
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures._base import Future
import math, os, glob, numpy as np

def get_result(x):
    if isinstance(x, Future):
        return x.result()
    return x

def accumulate(futures: Union[Tuple, List, Set, Dict, Any]) -> Union[Tuple, List, Set, Dict, Any]:
    """Join operation on a single future or a collection of futures."""
    if isinstance(futures, list):
        return [get_result(future) for future in futures]
    elif isinstance(futures, tuple):
        return tuple([get_result(future) for future in futures])
    elif isinstance(futures, set):
        return set([get_result(future) for future in futures])
    elif isinstance(futures, dict):
        return {k: get_result(v) for k, v in futures.items()}
    else:
        return get_result(futures)

def num_zeros_to_pad(max_i: int) -> int:
    assert isinstance(max_i, int) and max_i >= 1
    num_zeros = math.ceil(math.log10(max_i))  ## Ref: https://stackoverflow.com/a/51837162/4900327
    if max_i == 10 ** num_zeros:  ## If it is a power of 10
        num_zeros += 1
    return num_zeros

def pad_zeros(i: int, max_i: int = None) -> str:
    assert isinstance(i, int) and i >= 0
    if max_i is None:
        return str(i)
    assert isinstance(max_i, int) and max_i >= i
    num_zeros: int = num_zeros_to_pad(max_i)
    return f'{i:0{num_zeros}}'

def load_embeddings(passages_fpath: str) -> Dict[str, np.ndarray]:
    npzdata: Tuple[List[str], np.ndarray] = np.load(
        passages_fpath, 
        allow_pickle=True,
    )
    assert isinstance(npzdata, tuple) and len(npzdata) == 2, f'Expected 2-tuple, found: {type(npzdata)} with len {len(npzdata)}'
    passage_embeddings: Dict[str, np.ndarray] = {
        passage_i_str: passage_embedding
        for passage_i_str, passage_embedding in zip(npzdata[0], list(npzdata[1]))
    }
    return passage_embeddings

def load_contriever_wiki_embeddings(
    passages_dir_path: str, 
    max_num_files: int=int(1e9), 
    file_glob: str = 'passages_*',
    num_threads: int = 127,
) -> Dict[str, np.ndarray]:
    passages_fpaths: List[str] = sorted(glob.glob(os.path.join(passages_dir_path, file_glob)))
    passages_fpaths: List[str] = passages_fpaths[:max_num_files]  ## Load only `max_num_files`
    num_threads: int = min(num_threads, max_num_files)

    executor = ThreadPoolExecutor(max_workers=num_threads)
    futures: Dict[str, Future] = {
        passages_fpath: executor.submit(load_embeddings, passages_fpath=passages_fpath)
        for passages_fpath in passages_fpaths
    }
    passage_embeddings: Dict[str, np.ndarray] = {}
    for passages_fpath, fut in futures.items():
        passage_embeddings.update(accumulate(fut))
        print(f'Completed: {len(passage_embeddings)/1e6:.3f}MM passages')
    return passage_embeddings

adivekar-utexas commented 1 year ago

@gizacard follow-up question...where are the original passages which were used to get these embeddings?

adivekar-utexas commented 1 year ago

Answering my own question above. HuggingFace datasets library has them: https://huggingface.co/datasets/wiki_dpr

from datasets import load_dataset
wiki_passages = load_dataset('wiki_dpr', 'psgs_w100.multiset.exact.no_embeddings')
wiki_passages['train']   ## This has an "id" column, match them to the keys of the `passage_embeddings` dict above