facebookresearch / dpr-scale

Scalable training for dense retrieval models.
262 stars 25 forks source link

scidocs seems to be bad formatted #14

Closed Dundalia closed 1 year ago

Dundalia commented 1 year ago

Hi colleagues. When trying to embed the corpus from the BeIR benchmark with dpr_scale/generate_embeddings.py it goes in error because some "text" fields of scidocs are NaN. I have corrected it by simply replace the NaN with empty strings:

import pandas as pd import numpy as np

BEIR_FOLDER = "/home/davide/DRAGON/dpr-scale/beir/"

scidocs_path = BEIR_FOLDER + "scidocs/collection.tsv" scidocs = pd.read_csv(scidocs_path, sep="\t")

scidocs.loc[~scidocs.text.apply(lambda x: isinstance(x, str)), "text"] = "" scidocs.loc[~scidocs.title.apply(lambda x: isinstance(x, str)), "title"] = ""

scidocs.to_csv(scidocs_path, sep="\t", index=False)