File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz

allenai / ir_datasets

Provides a common interface to many IR ranking datasets.

Apache License 2.0

309 stars 42 forks source link

Describe the bug In msmarco_passage.py line 199-204, the dev/small dataset was:

subsets['dev/small'] = Dataset(
        collection,
        TsvQueries(Cache(TarExtract(dlc['collectionandqueries'], 'queries.dev.small.tsv'), base_path/'dev/small/queries.tsv'), namespace='msmarco', lang='en'),
        TrecQrels(Cache(TarExtract(dlc['collectionandqueries'], 'qrels.dev.small.tsv'), base_path/'dev/small/qrels'), QRELS_DEFS),
        TrecScoredDocs(Cache(ExtractQidPid(TarExtract(dlc['dev/scoreddocs'], 'top1000.dev')), base_path/'dev/ms.run')),
    )

I took a look at the structure in collectionandqueries.tar.gz and it matches with what stated above. However, structure is different for top1000.dev.tar.gz:

top1000.dev.tar.gz
|-- top1000.dev.tar
     |-- top1000.dev

In the downloaded tar.gz file, there were no dev/scoreddocs, and the top1000.dev was kept within top1000.dev.tar instead.

Affected dataset(s)

msmarco-passage

To Reproduce Steps to reproduce the behavior:

Download the top1000.dev.tar.gz from here
Open the file in 7zip or anything similar.

Expected behavior Following what was stated at msmarco_passage.py Line 203, I would expect the following structure:

top1000.dev.tar.gz
|-- dev
     |-- scoreddocs.tar
          |-- top1000.dev

top1000.dev.tar.gz
|-- dev/scoreddocs.tar
     |-- top1000.dev

Additional context Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

$ ir_datasets export msmarco-passage/dev/small scoreddocs ... [INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831 ...

$ curl https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz > /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831 $ ir_datasets export msmarco-passage/dev/small scoreddocs ... 188714 Q0 1000052 0 0.0 run 1082792 Q0 1000084 0 0.0 run 995526 Q0 1000094 0 0.0 run 199776 Q0 1000115 0 0.0 run 660957 Q0 1000115 0 0.0 run ...

allenai / ir_datasets

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209