To make the split like the paper, we now filter out unanswerable questions when splitting.
Note that the paper on arXiv may still be out of date and have the old numbers. The latest numbers are on OpenReview.net for our ICLR submission: https://openreview.net/forum?id=ry3iBFqgl: 92,549 samples
for training, 5,166 for validation, and 5,126 for testing. Those numbers match the current split from this code. We'll update the paper on arXiv soon or after the review is done. Thanks for pointing this out @dirkweissenborn
Loading is 2 times faster
Loading and other processing is now much faster by using itertuples instead of iterrows and being more careful about when we're updating answer_char_ranges when loading.
To make the split like the paper, we now filter out unanswerable questions when splitting.
Note that the paper on arXiv may still be out of date and have the old numbers. The latest numbers are on OpenReview.net for our ICLR submission: https://openreview.net/forum?id=ry3iBFqgl: 92,549 samples for training, 5,166 for validation, and 5,126 for testing. Those numbers match the current split from this code. We'll update the paper on arXiv soon or after the review is done. Thanks for pointing this out @dirkweissenborn
Loading is 2 times faster
Loading and other processing is now much faster by using
itertuples
instead ofiterrows
and being more careful about when we're updatinganswer_char_ranges
when loading.Before
Now