Relation between Line Endings and Span Indices

Hi, Thanks for making this dataset available to the public. I'm trying to do some work with it it, but It seems like there are some anomalies with the encoding with regards to line endings and the span listed as the label.

For example, in storyID ./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.passage (the first example in the train split) asking the question "What was the amount of children murdered?", the raw text of passage is:

NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed ""the house of horrors.""

Moninder Singh Pandher was sentenced to death by a lower court in February.

The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.

Read by a CSV reader, it becomes (note the \ns, which represent themselves in the raw csv as line carriages)

NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.

It seems each \n character was treated as 2 characters, rather than 1. The character span answer for this passage is supposed to be 294-297, but if you interpret \n as one character (as many programming languages do, especially since the \n was not escaped when python writing to the file in order to create the dataset), the actual span of the answer is 290-293.

In the case of CLRF, I'm not too sure what's happening, if you look at ./cnn/stories/318f71eba1831f330d423043827aa24e565fd329.story, the question is Who is Radu Mazare?. Now the story_text here has CLRF line endings (\r\n). The raw text of a relevant excerpt of the passage is:

(CNN) -- Jewish organizations called for a Romanian official to resign and face a criminal investigation after he wore a Nazi uniform during a fashion show over the weekend.\r\n
\r\n
Radu Mazare, the mayor of the town of Constanta, wore a Nazi uniform during a fashion show over the weekend.\r\n....

Again, it seems like \r\n was treated as 4 characters, but I'm not too sure in this case. as reading getting the span/substring from 196-228 (the actual answer) of:

(CNN) -- Jewish organizations called for a Romanian official to resign and face a criminal investigation after he wore a Nazi uniform during a fashion show over the weekend.\r\n\r\nRadu Mazare, the mayor of the town of Constanta, wore a Nazi uniform during a fashion show over the weekend.\r\n

yields "yor of the town of Constanta, wo which doesn't seem quite right. How should these be interpreted?

So what am I trying to convey? I have a few concrete feature requests:

It'd be great if you could make the span reflected in the label match the string as it is read programmatically (e.g. story.substring(span_start, span_end) should output the correct answer). There are two ways for you to do this at this point -- you could either modify all your labels to match with the story text as it is now (probably the preferred solution), or unescape the special line endings in the story text (this would be quite messy though)
it would be great if you could maintain consistency on whether you use CRLF or LF line endings (my opinion is that LF line endings would be better in this case)
Best of all, you could do away with newlines all together and replace them with spaces. I don't think any models will miss the line-carriages (in the sense that they are likely not relevant features), and it would make the NewsQA dataset much easier to work with. Of course, you would have to modify the spans accordingly if you decide to take this route.

Thanks!

Maluuba / newsqa

Relation between Line Endings and Span Indices #9