allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
306 stars 40 forks source link

IO error related to the dataset compressed with .Z #244

Closed yzong12138 closed 5 months ago

yzong12138 commented 11 months ago

nocr/trec-robust-2004 dataset, cannot ready the document from the file successfully A clear and concise description of what the bug is.

Affected dataset(s) nocr/trec-robust-2004

To Reproduce Steps to reproduce the behavior:

  1. First get the access of the Disk 4 and 5 from the individual agreement
  2. Copy the FBIS, FR94, etc which contains the data compressed at the format .Z, .0Z to the path ~/.ir_datasets/disks45/corpus/NEWS_data
  3. Run the code by
    import ir_datasets
    robust_dataset = ir_datasets.load("disks45/nocr/trec-robust-2004")
    print([query for query, _ in zip(robust_dataset.docs_iter(), range(2))])

Expected behavior

>>> [query for query, _ in zip(robust_dataset.docs_iter(), range(2))]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "/home/zong/.conda/envs/xp/lib/python3.10/site-packages/ir_datasets/util/__init__.py", line 147, in __next__
    return next(self.it)
  File "/home/zong/.conda/envs/xp/lib/python3.10/site-packages/ir_datasets/formats/trec.py", line 127, in docs_iter
    yield from self._docs_iter(path)
  File "/home/zong/.conda/envs/xp/lib/python3.10/site-packages/ir_datasets/formats/trec.py", line 162, in _docs_iter
    with io.BytesIO(unlzw3.unlzw(path)) as f:
  File "/home/zong/.conda/envs/xp/lib/python3.10/site-packages/unlzw3/__init__.py", line 70, in unlzw
    ba_in = bytearray(data)
TypeError: string argument without an encoding
>>> 
yzong12138 commented 11 months ago

The bug should comes from the line 162: It should change from

with io.BytesIO(unlzw3.unlzw(path)) as f:

to

with io.BytesIO(unlzw3.unlzw(Path(path))) as f:

Before it will pass a str to the unlzw to decompress but now the input for the unlzw() is a Path so it will try to load the data from the path first.

seanmacavaney commented 5 months ago

closed with #247