Describe the bug
When iterating over documents on Robust04 (disks45/nocr/trec-robust-2004), it fails to return documents that are stored in plain text file instead of compressed.
the problem here is that the variable path is an str, instead of a Path object. So calling open on it will not work.
Either change the line to
open(path, 'rb') as f:
or
Path(path).open ('rb') as f:
Affected dataset(s)
disks45/nocr
To Reproduce
Steps to reproduce the behavior:
with plain (uncompressed) files on ~/.ir_datasets/disks45/corpus/ (e.g. ~/.ir_datasets/disks45/corpus/FBIS/FB396001 is uncompressed)
try:
dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
for doc in dataset.docs_iter():
print(doc)
yields:
AttributeError: 'str' object has no attribute 'open'
Less important, and it can be an issue with my version of the dataset: filenames and folders can be lowercase (i.e. disks45/corpus/fbis/fb496247), so the glob would not find these
Describe the bug When iterating over documents on Robust04 (disks45/nocr/trec-robust-2004), it fails to return documents that are stored in plain text file instead of compressed.
Seems like an easy fix on this line: https://github.com/allenai/ir_datasets/blob/27317b2951a2c7f843ffc7c8d0b245acdc784c7f/ir_datasets/formats/trec.py#L139
the problem here is that the variable
path
is anstr
, instead of aPath
object. So callingopen
on it will not work.Either change the line to
or
Affected dataset(s) disks45/nocr
To Reproduce Steps to reproduce the behavior: with plain (uncompressed) files on ~/.ir_datasets/disks45/corpus/ (e.g.
~/.ir_datasets/disks45/corpus/FBIS/FB396001
is uncompressed) try:yields: