In #101 (C4 + TREC Health Misinformation 2021), I abstracted much of the annoying bits of writing an iterator over document sources into base classes. This should make adding new large datasets considerably easier, with less boilerplate. I should go back and see which prior document collections could be simplified by making use of this.
I believe the datasets that could benefit from this would be:
[ ] gov2
[ ] msmarco-passage-v2
[ ] tweets2013-ia
[ ] clueweb09 & clueweb12
[ ] Maybe even the standard docstore implementation?
In #101 (C4 + TREC Health Misinformation 2021), I abstracted much of the annoying bits of writing an iterator over document sources into base classes. This should make adding new large datasets considerably easier, with less boilerplate. I should go back and see which prior document collections could be simplified by making use of this.
I believe the datasets that could benefit from this would be:
gov2
msmarco-passage-v2
tweets2013-ia
clueweb09
&clueweb12