allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
309 stars 42 forks source link

Disks45 cannot read docs from plain text files. #188

Closed ArthurCamara closed 2 years ago

ArthurCamara commented 2 years ago

Describe the bug When iterating over documents on Robust04 (disks45/nocr/trec-robust-2004), it fails to return documents that are stored in plain text file instead of compressed.

Seems like an easy fix on this line: https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L139

the problem here is that the variable path is an str, instead of a Path object. So calling open on it will not work.

Either change the line to

open(path, 'rb') as f:

or

Path(path).open ('rb') as f:

Affected dataset(s) disks45/nocr

To Reproduce Steps to reproduce the behavior: with plain (uncompressed) files on ~/.ir_datasets/disks45/corpus/ (e.g. ~/.ir_datasets/disks45/corpus/FBIS/FB396001 is uncompressed) try:

dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for doc in dataset.docs_iter():
    print(doc)

yields:

AttributeError: 'str' object has no attribute 'open'
ArthurCamara commented 2 years ago

Less important, and it can be an issue with my version of the dataset: filenames and folders can be lowercase (i.e. disks45/corpus/fbis/fb496247), so the glob would not find these