facebookresearch / LAMA

LAnguage Model Analysis
Other
1.36k stars 184 forks source link

Dataset size mismatch #29

Closed jhyuklee closed 4 years ago

jhyuklee commented 4 years ago

Hi, thank you for open-sourcing this great project!

I looked into the datasets provided in this repository (https://dl.fbaipublicfiles.com/LAMA/data.zip) and some of their sizes do not match with the sizes described in the paper.

ConceptNet: 11458 (paper) vs 29774 (dataset) Google-RE death-place: 765 (paper) vs 766 (dataset)

Also for the TREx dataset, could you explain how the sentences are selected from the 'evidences' in each line of jsonl file? There seems to be multiple 'masked_sentence' in 'evidences'.

Thank you.

fabiopetroni commented 4 years ago

Hey @jhyuklee,

when you run run_experiments.py some of the datapoint will be filtered out. Look here https://github.com/facebookresearch/LAMA/blob/master/scripts/batch_eval_KB_completion.py#L223 You should get the same number of datapoints as we report in the paper once you run the script.

Ciao, Fabio