Maluuba / newsqa

Tools for using Maluuba's NewsQA Dataset (public version)
https://www.microsoft.com/en-us/research/project/newsqa-dataset/
Other
253 stars 58 forks source link

Relation between Line Endings and Span Indices #9

Closed nelson-liu closed 7 years ago

nelson-liu commented 7 years ago

Hi, Thanks for making this dataset available to the public. I'm trying to do some work with it it, but It seems like there are some anomalies with the encoding with regards to line endings and the span listed as the label.

For example, in storyID ./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.passage (the first example in the train split) asking the question "What was the amount of children murdered?", the raw text of passage is:

NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed ""the house of horrors.""

Moninder Singh Pandher was sentenced to death by a lower court in February.

The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.

Read by a CSV reader, it becomes (note the \ns, which represent themselves in the raw csv as line carriages)

NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.

It seems each \n character was treated as 2 characters, rather than 1. The character span answer for this passage is supposed to be 294-297, but if you interpret \n as one character (as many programming languages do, especially since the \n was not escaped when python writing to the file in order to create the dataset), the actual span of the answer is 290-293.

In the case of CLRF, I'm not too sure what's happening, if you look at ./cnn/stories/318f71eba1831f330d423043827aa24e565fd329.story, the question is Who is Radu Mazare?. Now the story_text here has CLRF line endings (\r\n). The raw text of a relevant excerpt of the passage is:

(CNN) -- Jewish organizations called for a Romanian official to resign and face a criminal investigation after he wore a Nazi uniform during a fashion show over the weekend.\r\n
\r\n
Radu Mazare, the mayor of the town of Constanta, wore a Nazi uniform during a fashion show over the weekend.\r\n....

Again, it seems like \r\n was treated as 4 characters, but I'm not too sure in this case. as reading getting the span/substring from 196-228 (the actual answer) of:

(CNN) -- Jewish organizations called for a Romanian official to resign and face a criminal investigation after he wore a Nazi uniform during a fashion show over the weekend.\r\n\r\nRadu Mazare, the mayor of the town of Constanta, wore a Nazi uniform during a fashion show over the weekend.\r\n

yields "yor of the town of Constanta, wo which doesn't seem quite right. How should these be interpreted?

So what am I trying to convey? I have a few concrete feature requests:

Thanks!