SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for XQuAD-R #593

Closed SamuelCahyawijaya closed 1 month ago

SamuelCahyawijaya commented 3 months ago

Dataloader name: xquadr/xquadr.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xquadr

Dataset xquadr
Description XQuAD-R is a retrieval version of the XQuAD dataset (a cross-lingual extractive QA dataset) that is a part of the LAReQA benchmark. Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question (out of around 1200) appears in 11 different languages and has 11 parallel correct answers across the languages. It is designed so as to include parallel QA pairs across languages, allowing questions to be matched with answers from different languages. The span-tagging task in XQuAD is converted into a retrieval task by breaking up each contextual paragraph into sentences, and treating each sentence as a possible target answer. There are around 1000 candidate answers in each language.
Subsets -
Languages tha, vie
Tasks Text Retrieval
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/google-research-datasets/lareqa
HF URL -
Paper URL https://aclanthology.org/2020.emnlp-main.477.pdf
akhdanfadh commented 3 months ago

self-assign

akhdanfadh commented 3 months ago

From the homepage:

Note that files contained in this repository for XQuAD-R are simply the original XQuAD data annotated with sentence boundaries for each of the paragraphs, added as an additional field in the jsons.

The additional fields are given like this (download english subset here):

"sentence_breaks": [
  [
      0,    -> sentence's str start index
      165   -> sentence's str end index
  ],
  ...
],
"sentences": [
  "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections.",  
  ├─> 1st sentence_break corresponds to 1st sentence
  ...
]

Given that XQuAD dataloader is already implemented and this one being focus on the retrieval, thus the task should only be Text Retrieval and the new dataloader should focus on those additional fields IMO.

How should I approach the mapping to pairs schema? I am also quite confused why text retrieval task is using that schema...

@holylovenia @SamuelCahyawijaya @sabilmakbar

holylovenia commented 3 months ago

Given that XQuAD dataloader is already implemented and this one being focus on the retrieval, thus the task should only be Text Retrieval and the new dataloader should focus on those additional fields IMO.

I agree with you, @akhdanfadh. I've updated the issue ticket and the datasheet accordingly.

How should I approach the mapping to pairs schema? I am also quite confused why text retrieval task is using that schema...

Because the existing text retrieval datasets commonly have a pair of texts and a label determining whether the pair is a positive or a negative pair. However, XQuAD-R is a QA retrieval task, which is a bit different.

Based on Section 3.1 of the paper:

Specifically, we break each contextual paragraph into sentences, and include all sentences across the dataset as candidate answers. A sentence is considered a correct answer to a question if it contains the target answer span for either that question or an equivalent question in another language (as identified by qas id).

And a data instance example of the dataset:

{
      "context": "The Broncos defeated the Pittsburgh Steelers in the divisional round, 23–16, by scoring 11 points in the final three minutes of the game. They then beat the defending Super Bowl XLIX champion New England Patriots in the AFC Championship Game, 20–18, by intercepting a pass on New England's 2-point conversion attempt with 17 seconds left on the clock. Despite Manning's problems with interceptions during the season, he didn't throw any in their two playoff games.",
      "qas": [
        {
          "answers": [
            {
              "answer_start": 25,
              "text": "Pittsburgh Steelers"
            }
          ],
          "id": "56beb7953aeaaa14008c92ab",
          "question": "Who lost to the Broncos in the divisional round?"
        },
        {
          "answers": [
            {
              "answer_start": 88,
              "text": "11"
            }
          ],
          "id": "56beb7953aeaaa14008c92ac",
          "question": "How many points did the Broncos score in the last three minutes of the game versus Pittsburgh?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56beb7953aeaaa14008c92ad",
          "question": "Who won Super Bowl XLIX?"
        },
        {
          "answers": [
            {
              "answer_start": 243,
              "text": "20–18"
            }
          ],
          "id": "56beb7953aeaaa14008c92ae",
          "question": "What was the final score of the AFC Championship Game?"
        },
        {
          "answers": [
            {
              "answer_start": 322,
              "text": "17 seconds"
            }
          ],
          "id": "56beb7953aeaaa14008c92af",
          "question": "How much time remained on the clock when the Broncos made the interception that clinched the AFC Championship Game?"
        },
        {
          "answers": [
            {
              "answer_start": 4,
              "text": "Broncos"
            }
          ],
          "id": "56bf36b93aeaaa14008c9561",
          "question": "What team was the divisional round winner between the Broncos and Steelers?"
        },
        {
          "answers": [
            {
              "answer_start": 70,
              "text": "23–16"
            }
          ],
          "id": "56bf36b93aeaaa14008c9562",
          "question": "What was the final score of the game between the Broncos and Steelers?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56bf36b93aeaaa14008c9563",
          "question": "Who won Super Bowl XLIX?"
        },
        {
          "answers": [
            {
              "answer_start": 322,
              "text": "17"
            }
          ],
          "id": "56bf36b93aeaaa14008c9564",
          "question": "How many seconds were left in the game when the Broncos intercepted the pass that won the game?"
        },
        {
          "answers": [
            {
              "answer_start": 360,
              "text": "Manning"
            }
          ],
          "id": "56bf36b93aeaaa14008c9565",
          "question": "During the Bronco's playoff games, who did not throw at all?"
        },
        {
          "answers": [
            {
              "answer_start": 25,
              "text": "Pittsburgh Steelers"
            }
          ],
          "id": "56d7018a0d65d214001982c2",
          "question": "Who did the Broncos beat in the divisional game?"
        },
        {
          "answers": [
            {
              "answer_start": 88,
              "text": "11"
            }
          ],
          "id": "56d7018a0d65d214001982c3",
          "question": "How many points did the Broncos score in the final three minutes of the Pittsburgh game?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56d7018a0d65d214001982c5",
          "question": "Who did the Broncos defeat in the AFC Championship game?"
        },
        {
          "answers": [
            {
              "answer_start": 25,
              "text": "Pittsburgh Steelers"
            }
          ],
          "id": "56d99f99dc89441400fdb628",
          "question": "Who did the Broncos beat to win their division in 2015?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56d99f99dc89441400fdb629",
          "question": "Who did the Broncos beat tp become the AFC champions?"
        },
        {
          "answers": [
            {
              "answer_start": 322,
              "text": "17"
            }
          ],
          "id": "56d99f99dc89441400fdb62c",
          "question": "How many seconds were left in the game when the Patriots failed their 2-point conversion?"
        }
      ],
      "sentence_breaks": [
        [
          0,
          137
        ],
        [
          138,
          351
        ],
        [
          352,
          464
        ]
      ],
      "sentences": [
        "The Broncos defeated the Pittsburgh Steelers in the divisional round, 23–16, by scoring 11 points in the final three minutes of the game.",
        "They then beat the defending Super Bowl XLIX champion New England Patriots in the AFC Championship Game, 20–18, by intercepting a pass on New England's 2-point conversion attempt with 17 seconds left on the clock.",
        "Despite Manning's problems with interceptions during the season, he didn't throw any in their two playoff games."
      ]
    }

Because of this, in XQuAD-R's dataloader, it seems that for each data instance:

In this case, it looks like the qa schema would be more suitable?

CMIIW.

akhdanfadh commented 3 months ago

@holylovenia

answer is taken from one of the sentences

In that case, instead of using text for the answer, we use sentence because this is about retrieval, right? And that will be the main difference with the xquad dataloader.

Aight, implementing this now by adding Tasks.QUESTION_ANSWERING_RETRIEVAL.

sabilmakbar commented 3 months ago

I think the answer field should be the actual answer provided in the text field, but the corresponding sentences that contain the answer can be put in the meta, along with all candidates, for the SEACrowd Schema implementation. wdyt?

akhdanfadh commented 3 months ago

@sabilmakbar I think that will be against the main idea of the dataset as it is more about the retrieval part. This dataset is also for benchmark, not about the ground truth text. Quoting the github homepage:

As part of the LAReQA benchmark, <...> We release XQuAD with sentence breaks in this repository for use as XQuAD-R. <...> Note that files contained in this repository for XQuAD-R are simply the original XQuAD data annotated with sentence boundaries for each of the paragraphs, added as an additional field in the jsons.

@holylovenia also quoted from the paper. Based on this, we can think of the dataset as a "multiple choice" with the choices being all the sentences. Though it's not.

Specifically, we break each contextual paragraph into sentences, and include all sentences across the dataset as candidate answers. A sentence is considered a correct answer to a question if it contains the target answer span for either that question or an equivalent question in another language (as identified by qas id).

Overall, we can still put the ground truth answer text in meta fields, though. So, no important data is being neglected here.