bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
455 stars 116 forks source link

Create dataset loader for SciFact #237

Closed jason-fries closed 2 years ago

jason-fries commented 2 years ago

Adding a Dataset

nbroad1881 commented 2 years ago

self-assign

nbroad1881 commented 2 years ago

This dataset does not quite fit any of the tasks. The authors' approach is to do text classification by passing two texts through the model. This is most similar to PAIRS, but the only listed task under PAIRS is STS which does not accurately describe this task. The task is TXTCLASS

The two approaches the authors took are:

  1. Given a claim and a sentence, label that sentence as a "rationale" (could be supporting or refuting the claim).
  2. Given a claim and a concatenation of all supporting and contradicting rationales (all of the evidence), label the claim with one of the following {SUPPORTS, REFUTES, NOINFO}

The PAIRS schema works for these approaches, but that task STS does not. Please advise on what task should be listed for this dataset.

galtay commented 2 years ago

This appears to be an entailment task to me. Do you think it would fit into this schema? https://github.com/bigscience-workshop/biomedical/blob/master/task_schemas.md#textual-entailment

nbroad1881 commented 2 years ago

In my opinion, the schema is essentially the same whether it's entailment or text pairs. I'll make the task and schema textual entailment as you recommend. Thank you.

galtay commented 2 years ago

Yes ... at one point we required the label in the entailment task to be one of [entailment, contradiction, neutral], but we ended up with the more general "string" version ... which makes the schema basically the same as the text pairs schema. My hope is that when people are doing multi-task learning on this bigbio corpus, they will be able to take advantage of the fact that entailment tasks have a label that is a triple that can be mapped onto the [entailment, contradiction, neutral] triple (which will not be true for datasets implemented in the text pairs schema). That seems to be true for approach 2 in your previous message. Taking a quick peek at the data I see things like,

{"id": 436, "claim": "Free histones are degraded by a Rad53-dependent mechanism once DNA has been replicated.", "evidence": {"14637235": [{"sentences": [1], "label": "SUPPORT"}, {"sentences": [2], "lab
el": "SUPPORT"}]}, "cited_doc_ids": [14637235]}

which I think lends itself to two entailment samples with claim mapped to premise and each evidence element mapped to a hypothesis. If you run into a situation where the labels are not in these triplet forms then please reach out and we can figure out what to do. thanks for contributing!