Closed Dundalia closed 1 year ago
Hi colleagues. When trying to embed the corpus from the BeIR benchmark with dpr_scale/generate_embeddings.py it goes in error because some "text" fields of scidocs are NaN. I have corrected it by simply replace the NaN with empty strings:
import pandas as pd import numpy as np
BEIR_FOLDER = "/home/davide/DRAGON/dpr-scale/beir/"
scidocs_path = BEIR_FOLDER + "scidocs/collection.tsv" scidocs = pd.read_csv(scidocs_path, sep="\t")
scidocs.loc[~scidocs.text.apply(lambda x: isinstance(x, str)), "text"] = "" scidocs.loc[~scidocs.title.apply(lambda x: isinstance(x, str)), "title"] = ""
scidocs.to_csv(scidocs_path, sep="\t", index=False)
Hi colleagues. When trying to embed the corpus from the BeIR benchmark with dpr_scale/generate_embeddings.py it goes in error because some "text" fields of scidocs are NaN. I have corrected it by simply replace the NaN with empty strings:
import pandas as pd import numpy as np
BEIR_FOLDER = "/home/davide/DRAGON/dpr-scale/beir/"
scidocs_path = BEIR_FOLDER + "scidocs/collection.tsv" scidocs = pd.read_csv(scidocs_path, sep="\t")
scidocs.loc[~scidocs.text.apply(lambda x: isinstance(x, str)), "text"] = "" scidocs.loc[~scidocs.title.apply(lambda x: isinstance(x, str)), "title"] = ""
scidocs.to_csv(scidocs_path, sep="\t", index=False)