facebookresearch / LAMA

LAnguage Model Analysis
Other
1.35k stars 182 forks source link

Sentence selection for T-REx and GoogleRE #31

Open uyaseen opened 4 years ago

uyaseen commented 4 years ago

Hi there,

How were the sentences (evidences for T-REx and considered_sentences for GoogleRE) selected for T-REx and GoogleRE? Was it done manually or using some script?

Thanks,

leo-liuzy commented 4 years ago

Yeah, I also have that question

fabiopetroni commented 4 years ago

Hi,

each datapoint in the original datasets (GoogleRE and T-REx) is accompanied with that information, a sentence where the fact manifests.

leo-liuzy commented 4 years ago

My observation is: there is no "masked_sentences" field in each T-REx.jsonl but there is "masked_sentence" in each items in "evidences". The examples looks like:

{"uuid": "40a3a9c4-f69a-4a49-ae5b-81b423ea5888", 
"obj_uri": "Q38", 
"obj_label": "Italy", 
"sub_uri": "Q1083561", 
"sub_label": "Soppressata", 
"predicate_id": "P495", 
"evidences": [{"sub_surface": "soppressata", 
"obj_surface": "Italy", 
"masked_sentence": "It is quite possible to find different preparations of saltwater fish and traditional southern cured meats (like soppressata or 'nduja) in the south of [MASK], whereas in northern Italy it will contain different kinds of cured meats and mushrooms and, especially near lakes, preparations of freshwater fish."}, 
{"sub_surface": "soppressata",
 "obj_surface": "Italy", 
 "masked_sentence": "It is quite possible to find different preparations of saltwater fish and traditional southern cured meats (like soppressata or 'nduja) in the south of Italy, whereas in northern [MASK] it will contain different kinds of cured meats and mushrooms and, especially near lakes, preparations of freshwater fish."}]}

Do we need to create a field called "masked_sentences" that consists of all the "masked_sentence"? Since LAMA code does not deal with this issue.

jivatneet commented 3 years ago

I'm facing the same issue. @fabiopetroni could you help? Did the TRex dataset version change as there is no "masked_sentences" field while all other datasets have that. Thankyou so much.

fabiopetroni commented 3 years ago

Hey,

thanks for your message. The masked_sentence field is not used in the LAMA evaluation (that would be way too easy for a LM - see appendix of the paper). We used templates! Have a look at here https://github.com/facebookresearch/LAMA/blob/master/scripts/run_experiments.py#L160 Good luck

Ciao, Fabio

jivatneet commented 3 years ago

Thanks a lot for your reply @fabiopetroni! I understand the field is initialized here https://github.com/facebookresearch/LAMA/blob/5cba81b6e55d4c596ce62b0166b1acd429a47f28/scripts/batch_eval_KB_completion.py#L421 but for an uncased model, the lowercase function is called before this step https://github.com/facebookresearch/LAMA/blob/5cba81b6e55d4c596ce62b0166b1acd429a47f28/scripts/batch_eval_KB_completion.py#L206 which was leading to this error and hence caused the issue.

morioka commented 3 years ago

I faced the same issue. I added the following code.

  1. use sentence in sample["masked_sentences"]. (original code)
  2. if a KeyError exception occurs due to sample['masked_sentences'] then use evidence['masked_setence'] of each evidence (sample['evidences']) as sentence.