allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
316 stars 42 forks source link

Cannot read LoTTE docs #251

Open ftvalentini opened 10 months ago

ftvalentini commented 10 months ago

Describe the bug There seems to be an issue when downloading/reading the lotte datasets.

Affected dataset(s) LoTTE

To Reproduce Run in Python:

import ir_datasets

dataset = ir_datasets.load("lotte/recreation/test")
for doc in dataset.docs_iter():
    print(doc)
    break

Get the error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/home/user/misc/ir-datasets.ipynb Cell 3 line 4
      1 import ir_datasets
      3 dataset = ir_datasets.load("lotte/recreation/test")
----> 4 for doc in dataset.docs_iter():
      5     print(doc)
      6     break

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/util/__init__.py:147, in DocstoreSplitter.__next__(self)
    146 def __next__(self):
--> 147     return next(self.it)

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/formats/tsv.py:92, in TsvIter.__next__(self)
     91 def __next__(self):
---> 92     line = next(self.line_iter)
     93     cols = line.rstrip('\n').split('\t')
     94     num_cols = len(self.cls._fields)

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/formats/tsv.py:28, in FileLineIter.__next__(self)
     26         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()))
     27     else:
---> 28         self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()))
     29 while self.pos < self.start:
     30     line = self.stream.readline()

File ~/miniconda3/envs/py311/lib/python3.11/contextlib.py:502, in _BaseExitStack.enter_context(self, cm)
    499 except AttributeError:
    500     raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object does "
    501                     f"not support the context manager protocol") from None
--> 502 result = _enter(cm)
    503 self._push_cm_exit(cm, _exit)
    504 return result

File ~/miniconda3/envs/py311/lib/python3.11/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File ~/miniconda3/envs/py311/lib/python3.11/site-packages/ir_datasets/util/fileio.py:148, in RelativePath.stream(self)
    146 @contextlib.contextmanager
    147 def stream(self):
--> 148     with open(self.path(), 'rb') as f:
    149         yield f

FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.ir_datasets/lotte/lotte_extracted/lotte/recreation/test/collection.tsv'

Expected behavior I should be seeing the first doc in the collection, as I successfully get with msmarco:

dataset = ir_datasets.load("beir/msmarco/test")
for doc in dataset.docs_iter():
    print(doc)
    break

returns:

GenericDoc(doc_id='0', text='The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.')

Additional context In the terminal, cd ~/.ir_datasets/lotte && ls -R . returns:

.:
lotte_extracted

./lotte_extracted:
lotte

./lotte_extracted/lotte:
lifestyle  recreation

./lotte_extracted/lotte/lifestyle:
test

./lotte_extracted/lotte/lifestyle/test:
collection.tsv.pklz4

./lotte_extracted/lotte/lifestyle/test/collection.tsv.pklz4:
bin  bin.meta

./lotte_extracted/lotte/recreation:
test

./lotte_extracted/lotte/recreation/test:
collection.tsv.pklz4

./lotte_extracted/lotte/recreation/test/collection.tsv.pklz4:
bin  bin.meta

I'm working with:

Python implementation: CPython
Python version       : 3.11.0
IPython version      : 8.14.0

ir_datasets: 0.5.5

Compiler    : GCC 11.3.0
OS          : Linux
Release     : 5.15.0-84-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit