facebookresearch / KILT

Library for Knowledge Intensive Language Tasks
MIT License
910 stars 91 forks source link

ELI5 KILT annotation #44

Closed carriex closed 3 years ago

carriex commented 3 years ago

Hi, thanks for creating the dataset! I have two questions regarding to the annotated ELI5 data:

Thank you!

fabiopetroni commented 3 years ago

Hi @carriex,

thanks a lot for your message. The meta field in the ELI5 dev set contains exhaustive information about our annotation campaign, including the span of text that was highlighted by the annotators. Here the full guidelines for the annotation campaign:

I hope this helps!

carriex commented 3 years ago

Hi @fabiopetroni,

thanks for providing the detailed explanation for how the annotation campaign was carried out! I have two follow-up clarification questions:

  1. For the field containing exhausive information of the annotation campaign, are you referring to the meta field in the below example, which includes a partial evidence? Also it looks like the partial evidence comes from a different wikipedia page than the one in output/provenance. However from the eval_retrieval.py it looks like we are still using the wikipedia page in output/provenance to evaluate the retrieval performance. Is this understanding correct?

    {'id': '1kiwfx', 'input': 'In Trading Places (1983, Akroyd/Murphy) how does the scheme at the end of the movie work? Why would buying a lot of OJ at a high price ruin the Duke Brothers?', 'meta': {'left_context': '', 'mention': '', 'obj_surface': {'text': array([], dtype=object)}, 'partial_evidence': {'end_paragraph_id': array([7], dtype=int32), 'meta': array([{'evidence_span': array(['On television, they learn that Clarence Beeks is transporting a secret USDA report on orange crop forecasts.', 'On television, they learn that Clarence Beeks is transporting a secret USDA report on orange crop forecasts. Winthorpe and Valentine recall large payments made to Beeks by the Dukes and realize that the Dukes plan to obtain the report to corner the market on frozen orange juice.', 'Winthorpe and Valentine recall large payments made to Beeks by the Dukes and realize that the Dukes plan to obtain the report to corner the market on frozen orange juice.'], dtype=object)} ], dtype=object), 'section': array(['Section::::Plot.\n'], dtype=object), 'start_paragraph_id': array([7], dtype=int32), 'title': array(['Trading Places'], dtype=object), 'wikipedia_id': array(['520990'], dtype=object)}, 'right_context': '', 'sub_surface': {'text': array([], dtype=object)}, 'subj_aliases': {'text': array([], dtype=object)}, 'template_questions': {'text': array([], dtype=object)}}, 'output': {'answer': array(['The final scene involves future contracts. ..."what happens at the end of Trading Places?"', ''], dtype=object), 'meta': array([], dtype=object), 'provenance': array([{'bleu_score': array([0.92328084], dtype=float32), 'end_character': array([612], dtype=int32), 'end_paragraph_id': array([1], dtype=int32), 'meta': array([], dtype=object), 'section': array(['Section::::Abstract.'], dtype=object), 'start_character': array([14], dtype=int32), 'start_paragraph_id': array([1], dtype=int32), 'title': array(['Futures contract'], dtype=object), 'wikipedia_id': array(['242855'], dtype=object)}], dtype=object)}

  2. My original question is actually around the meta field inside output/provenance. For example in the instance below (bolded), there is such a field. However for the example above, the annotation only contains start/end character/paragraph of the wikipedia page. I'm wondering what is the difference between these two kinds of annotations?

{'id': '3atjp2', 'input': 'what are benefits of TPP ?', 'meta': {'left_context': '', 'mention': '', 'obj_surface': {'text': array([], dtype=object)}, 'partial_evidence': {'end_paragraph_id': array([], dtype=int32), 'meta': array([], dtype=object), 'section': array([], dtype=object), 'start_paragraph_id': array([], dtype=int32), 'title': array([], dtype=object), 'wikipedia_id': array([], dtype=object)}, 'right_context': '', 'sub_surface': {'text': array([], dtype=object)}, 'subj_aliases': {'text': array([], dtype=object)}, 'template_questions': {'text': array([], dtype=object)}}, 'output': {'answer': array(['The TPP is a trade liberalization treaty...why would FR/UK/NZ etc. want to sign it France and the UK are not part of TPP. That's TTIP, a similar but separate deal.", ''], dtype=object), 'meta': array([], dtype=object), 'provenance': array([{'bleu_score': array([0.], dtype=float32), 'end_character': array([-1], dtype=int32), 'end_paragraph_id': array([1], dtype=int32), 'meta': array([{'annotation_id': '-1', 'evidence_span': {'text': array(['Theory of Motivated Information Management or TMIM, is a social-psychological framework that examines the relationship between information management and uncertainty. The theory posits that individuals are motivated to manage their uncertainty levels when they perceive a discrepancy between the level of uncertainty they have about an important issue and the level of uncertainty they want. In other words, someone may be uncertain about an important issue but decides not to engage or seek information because they are comfortable with that state.\rhighlight sentence(s) containing evidence, not only the answer', 'Theory of Motivated Information Management or TMIM, is a social-psychological framework that examines the relationship between information management and uncertainty. The theory posits that individuals are motivated to manage their uncertainty levels when they perceive a discrepancy between the level of uncertainty they have about an important issue and the level of uncertainty they want. In other words, someone may be uncertain about an important issue but decides not to engage or seek information because they are comfortable with that state.'], dtype=object)}, 'fever_page_id': '', 'fever_sentence_id': -1, 'yes_no_answer': ''} ], dtype=object), 'section': array(['Section::::Abstract.'], dtype=object), 'start_character': array([-1], dtype=int32), 'start_paragraph_id': array([1], dtype=int32), 'title': array(['Theory of Motivated Information Management'], dtype=object), 'wikipedia_id': array(['36119336'], dtype=object)} ], dtype=object)}}

Again, thanks very much for your help!

fabiopetroni commented 3 years ago
  1. we don't consider partial evidence in KILT. A wikipedia page is added to the output only if we find the majority of annotators indicating that as containing full evidence. So there might be other pages with partial evidence in meta.
  2. I don't fully get the question. In meta we report the complete annotation information, including partial evidence annotation and evidence span if available (even if we ignore this information in the evaluation).
carriex commented 3 years ago

thanks for getting back so quickly! The answer for the first question makes sense to me.

For the second question, I'm referring to the fields inside output/provenance. For the two examples above, the question with id 1kiwfx doesn't contain an evidence span in output/provenance, but only start_character, end_character, start_paragraph and end_paragraph. However for the question with id 3atjp2, there are two evidence spans inside output/provenance, whereas start_character and end_character contains value -1. Does it mean the annotator highlighted an evidence span for 3atjp2, but only selected "Yes, sufficient to answer" for the passage in 1kiwfx (without highlighting any evidence span)? Sorry for any confusion caused, let me know if this is clear to you!

fabiopetroni commented 3 years ago

I see. So sometimes the evidence span is given as offset in a paragraph within the knowledge source, sometimes as a string (probably there the automatic script failed). I hope this helps :)

carriex commented 3 years ago

@fabiopetroni got it! thank you so much for your help! :)