scidocs seems to be bad formatted

Hi colleagues. When trying to embed the corpus from the BeIR benchmark with dpr_scale/generate_embeddings.py it goes in error because some "text" fields of scidocs are NaN. I have corrected it by simply replace the NaN with empty strings:

import pandas as pd import numpy as np

BEIR_FOLDER = "/home/davide/DRAGON/dpr-scale/beir/"

scidocs_path = BEIR_FOLDER + "scidocs/collection.tsv" scidocs = pd.read_csv(scidocs_path, sep="\t")

scidocs.loc[~scidocs.text.apply(lambda x: isinstance(x, str)), "text"] = "" scidocs.loc[~scidocs.title.apply(lambda x: isinstance(x, str)), "title"] = ""

scidocs.to_csv(scidocs_path, sep="\t", index=False)

facebookresearch / dpr-scale

scidocs seems to be bad formatted #14