Filter out unanswerable questions when splitting

To make the split like the paper, we now filter out unanswerable questions when splitting.

Note that the paper on arXiv may still be out of date and have the old numbers. The latest numbers are on OpenReview.net for our ICLR submission: https://openreview.net/forum?id=ry3iBFqgl: 92,549 samples for training, 5,166 for validation, and 5,126 for testing. Those numbers match the current split from this code. We'll update the paper on arXiv soon or after the review is done. Thanks for pointing this out @dirkweissenborn

Loading is 2 times faster

Loading and other processing is now much faster by using itertuples instead of iterrows and being more careful about when we're updating answer_char_ranges when loading.

Before

[INFO] 2016-12-22 13:17:52,071 - data_processing.py::__init__
Loading dataset from `c:\Users\Justin\workspace\newsqa\maluuba\newsqa\newsqa-data-v1.csv`...
[INFO] 2016-12-22 13:17:52,615 - data_processing.py::__init__
Loading stories from `c:\Users\Justin\workspace\newsqa\maluuba\newsqa\cnn_stories.tgz`...
Getting story texts: 100%|##############################################################################| 12.7K/12.7K [00:14<00:00, 885 stories/s]
Setting story texts: 100%|############################################################################| 120K/120K [00:15<00:00, 7.65K questions/s]
[INFO] 2016-12-22 13:18:22,655 - data_processing.py::__init__
Done loading dataset.

Now

[INFO] 2016-12-22 13:17:14,960 - data_processing.py::__init__
Loading dataset from `c:\Users\Justin\workspace\newsqa\maluuba\newsqa\newsqa-data-v1.csv`...
[INFO] 2016-12-22 13:17:15,365 - data_processing.py::__init__
Loading stories from `c:\Users\Justin\workspace\newsqa\maluuba\newsqa\cnn_stories.tgz`...
Getting story texts: 100%|##############################################################################| 12.7K/12.7K [00:14<00:00, 884 stories/s]
Setting story texts: 100%|############################################################################| 120K/120K [00:02<00:00, 49.4K questions/s]
[INFO] 2016-12-22 13:17:32,217 - data_processing.py::__init__
Done loading dataset.

Maluuba / newsqa

Filter out unanswerable questions when splitting #7

Loading is 2 times faster

Before

Now