allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
309 stars 42 forks source link

File structure stated in msmarco_passage.py is not aligned with downloaded top1000.dev.tar.gz #209

Open yuenherny opened 1 year ago

yuenherny commented 1 year ago

Describe the bug In msmarco_passage.py line 199-204, the dev/small dataset was:

subsets['dev/small'] = Dataset(
        collection,
        TsvQueries(Cache(TarExtract(dlc['collectionandqueries'], 'queries.dev.small.tsv'), base_path/'dev/small/queries.tsv'), namespace='msmarco', lang='en'),
        TrecQrels(Cache(TarExtract(dlc['collectionandqueries'], 'qrels.dev.small.tsv'), base_path/'dev/small/qrels'), QRELS_DEFS),
        TrecScoredDocs(Cache(ExtractQidPid(TarExtract(dlc['dev/scoreddocs'], 'top1000.dev')), base_path/'dev/ms.run')),
    )

I took a look at the structure in collectionandqueries.tar.gz and it matches with what stated above. However, structure is different for top1000.dev.tar.gz:

top1000.dev.tar.gz
|-- top1000.dev.tar
     |-- top1000.dev

In the downloaded tar.gz file, there were no dev/scoreddocs, and the top1000.dev was kept within top1000.dev.tar instead.

Affected dataset(s)

To Reproduce Steps to reproduce the behavior:

  1. Download the top1000.dev.tar.gz from here
  2. Open the file in 7zip or anything similar.

Expected behavior Following what was stated at msmarco_passage.py Line 203, I would expect the following structure:

top1000.dev.tar.gz
|-- dev
     |-- scoreddocs.tar
          |-- top1000.dev

or

top1000.dev.tar.gz
|-- dev/scoreddocs.tar
     |-- top1000.dev

Additional context Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

seanmacavaney commented 1 year ago

Thanks for the report. I'm not able to reproduce it when following the instructions provided by the software:

Specifically:

When requesting scoreddocs of msmarco-passage/dev/small, I get the following message as it starts downloading:

$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
...

If I stop it there, perform the download to the specified location, and re-run. It works without a hitch:

$ curl https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz > /home/sean/.ir_datasets/downloads/8c140662bdf123a98fbfe3bb174c5831
$ ir_datasets export msmarco-passage/dev/small scoreddocs
...
188714 Q0 1000052 0 0.0 run
1082792 Q0 1000084 0 0.0 run
995526 Q0 1000094 0 0.0 run
199776 Q0 1000115 0 0.0 run
660957 Q0 1000115 0 0.0 run
...

(The same would happen if using the Python API, rather than the CLI.)

It looks like above you're doing more of the extraction yourself, which I generally would not advise. First, it means that the downloads are not verified, so if there was a problem downloading the data, you may inadvertently be working with an incomplete or incorrect set of the data. Second, you may not perform the same pre-processing stages as the software, which can cause problems.

In most cases, I'd suggest just letting the software download the files automatically for you. It really only makes sense to copy/symlink them if you already have a copy and don't want to bother waiting for the download. And when you do this, it's best the follow the instructions given by the software about where to place the files.

In this case, TarExtract transparently performs gzip decompression, in addition to extracting the file. It then performs additional processing via ExtractQidPid to convert the file into a standard file format.

Also appreciate if there is a symlink tutorial for Windows user. Looking at the bugs I experienced, I guess this library is primarily written in (and for) Linux OS.

There's a GitHub Action that runs tests for Windows, but I don't have a Windows machine myself to test stuff on. Nor am I particularly experienced with Windows systems, in general, to provide advice. I appreciate the reports to help improve Windows support, and would welcome contributions that improve the experience on Windows!