allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

Update tsv.py #263

Closed tonellotto closed 2 months ago

tonellotto commented 2 months ago

The following code doesn't run correctly it the docstore is not already cached in pklz4 folder:

import ir_datasets
import more_itertools

dataset = ir_datasets.load('msmarco-passage')

for batch in more_itertools.chunked(dataset.docs_iter(), 8196*4):
    print(len(batch))

With this single line, it can, without raising exceptions.