facebookresearch / MLQA

New dataset
Other
294 stars 24 forks source link

How to get the sentence boundary? #7

Closed wasiahmad closed 4 years ago

wasiahmad commented 4 years ago

In the paper, you mentioned, "We use whitespace tokenization for all of the MLQA languages other than Chinese". I am wondering is there any suggested way to get the sentence boundaries, so that, we can use additional information of the sentences?

patrick-s-h-lewis commented 4 years ago

Hi Wasi,

Multilingual sentence segmentation is challenging. I believe we used moses' sentence splitter during development, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ems/support/split-sentences.perl. But this may not cover all the languages, The researcher who did it is on leave, we can ask him when he returns how this was done

Patrick

wasiahmad commented 4 years ago

Can you tell which whitespace tokenizer was used to perform tokenization? I tried simple whitespace tokenization but in many cases, I was unable to match the token offsets with the ground truth answer span. The main problem is the punctuation symbols.

It would be helpful if you can provide the tokenization script that converts the character offset into word offset for the ground truth answer span.

patrick-s-h-lewis commented 4 years ago

Hi wasi,

Answer spans are highlighted by humans, there is no tokenization there. Modelling this correclty is part of the challenge of the dataset.

The evaluation script has the whitespace tokenization that is used to evaluate models.

Patrick