allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

Cannot get documents using docstore on ClueWeb09? #134

Closed PxYu closed 2 years ago

PxYu commented 2 years ago

Describe the bug Hi. I have trouble using the docstore.get() function. About half the documents that I am interested in return the following message. Some document ids work just fine, which makes this very confusing...

Traceback (most recent call last):
  File "acquire_document_texts.py", line 37, in <module>
    tmp = docstore.get(docid)
  File "/home/username/miniconda3/lib/python3.8/site-packages/ir_datasets/indices/base.py", line 10, in get
    result = self.get_many([doc_id], field)
  File "/home/username/miniconda3/lib/python3.8/site-packages/ir_datasets/indices/base.py", line 18, in get_many
    for doc in self.get_many_iter(doc_ids):
  File "/home/username/miniconda3/lib/python3.8/site-packages/ir_datasets/indices/cache_docstore.py", line 22, in get_many_iter
    for doc in self.full_store.get_many_iter(doc_ids_remaining):
  File "/home/username/miniconda3/lib/python3.8/site-packages/ir_datasets/indices/clueweb_warc.py", line 150, in get_many_iter
    yield from index.get_many_iter(doc_ids, self.warc_docs)
  File "/home/username/miniconda3/lib/python3.8/site-packages/ir_datasets/indices/clueweb_warc.py", line 114, in get_many_iter
    f.read(out_offset)
  File "/home/username/miniconda3/lib/python3.8/site-packages/zlib_state/__init__.py", line 106, in read
    result += self.read1(count - len(result))
  File "/home/username/miniconda3/lib/python3.8/site-packages/zlib_state/__init__.py", line 119, in read1
    size = self.raw.readinto(self.buffer)
  File "/home/username/miniconda3/lib/python3.8/site-packages/zlib_state/__init__.py", line 38, in readinto
    count += self.decomp.read(outbytes=buf)
RuntimeError: invalid code lengths set

This is how I initiate the docstore:

dataset = ir_datasets.load("clueweb09/en")
docstore = dataset.docs_store()

Affected dataset(s)

Clueweb09.

To Reproduce

import ir_datasets

dataset = ir_datasets.load("clueweb09/en")
docstore = dataset.docs_store()
docid = "clueweb09-en0000-00-05965"
tmp = docstore.get(docid)

This snippet would result in the aforementioned message. I do not know what is the issue. Is it that my clueweb files are corrupted? I tried to clear cache for docstore, didn't help.

Any hint or help would be appreciated. Thanks!

PxYu commented 2 years ago

The clueweb dataset directory, if it might be helpful.

ls ~/.ir_datasets/clueweb09/corpus/

checksums            ClueWeb09_Chinese_3   ClueWeb09_English_2  ClueWeb09_English_6  ClueWeb09_French_1    ClueWeb09_Japanese_2    ClueWeb09_Spanish_2
ClueWeb09_Arabic_1   ClueWeb09_Chinese_4   ClueWeb09_English_3  ClueWeb09_English_7  ClueWeb09_German_1    ClueWeb09_Korean_1      record_counts
ClueWeb09_Chinese_1  ClueWeb09_English_1   ClueWeb09_English_4  ClueWeb09_English_8  ClueWeb09_Italian_1   ClueWeb09_Portuguese_1
ClueWeb09_Chinese_2  ClueWeb09_English_10  ClueWeb09_English_5  ClueWeb09_English_9  ClueWeb09_Japanese_1  ClueWeb09_Spanish_1

Folders other than checksums and record_counts are created using soft links.

seanmacavaney commented 2 years ago

Thanks for reporting. The above code works for me, but let's try to get to the bottom of what's going on here.

Can you run md5sum ~/.ir_datasets/clueweb09/corpus/ClueWeb09_English_1/en0000/00.warc.gz (the file containing the above document). My version is 82cd52301030a50fdd6fa6e198bb6a07, which matches what's in the checksums file (head ~/.ir_datasets/clueweb09/corpus/checksums//ClueWeb09_English_1_checksums.md5 -n1).

PxYu commented 2 years ago

My version is e200a39790702507e7f3b91eb5d93515.

seanmacavaney commented 2 years ago

Gotcha, thanks. So this means that the zlib checkpoints will not work for performing fast lookups -- they are dependent on the files being compressed the same way.

This is a case we hadn't considered when designing this setup for working with ClueWeb. I think there are a couple of options:

  1. Detect a compression mismatch and disable the checkpointed lookups. This will mean looking up documents could be pretty slow, since it will potentially need to read & decompress an entire file to get a document that appears at the end.
  2. Detect a compression mismatch and manually/automatically build your own checkpoints. This process takes a long time, but would at least mean that you could benefit from fast lookups once it's done. There's an undocumented command that does this, but I'm not sure how well it would work in your setting because I've only tested it on the files that match the official versions of the source files.
  3. Do nothing. Require the source files with the original compression to be present. Potentially validate them by computing checksum(s) and throw a more informative error in this case.

What do you think? Is it possible that you have a version of ClueWeb that matches the official checksums somewhere?

cc @andrewyates

PxYu commented 2 years ago

Thanks for the explanation!

I don't know why our files do not match the official checksums. The files have been on our server since 2009. I will contact our system administrator to see if I can get a more original copy.

In case that doesn't work, I might go with option 1. My use case is only document lookup (for around 60k documents). Is there something in the repo that could help this process?

seanmacavaney commented 2 years ago

There's nothing built into ir_datasets right now to do this. But here's a workaround for you: you can replace the checkpoint files with empty ones. Then it will always traverse the entire file (up to the requested documents) when performing lookups.

touch template
lz4c template
for f in ~/.ir_datasets/clueweb09/corpus.chk/*/*/*.chk.lz4 ; do cp template.lz4 $f ; done

Then you should be able to run the code from above no problem:

import ir_datasets

dataset = ir_datasets.load("clueweb09/en")
docstore = dataset.docs_store()
docid = "clueweb09-en0000-00-05965"
tmp = docstore.get(docid)

This seems to work fine for me-- let me know if it works for you!

Also- as a hint, use docstore.get_many or docstore.get_many_iter to avoid traversing the same file multiple times when a file contains multiple documents.

I'm still debating whether this is functionality we should build into ir_datasets or not.

PxYu commented 2 years ago

error message:

FileNotFoundError: [Errno 2] No such file or directory: '/home/username/.ir_datasets/clueweb09/corpus.chk/ClueWeb09_English_10/en0131/68.warc.gz.chk.lz4'

seanmacavaney commented 2 years ago

Huh, is there anything in ~/.ir_datasets/clueweb09/corpus.chk/ClueWeb09_English_10/en0131/? The for f in ~/.ir_datasets code above should have taken care of replacing all the files it was expecting.

PxYu commented 2 years ago

Weird that 68.warc.gz.chk.lz4 wasn't in ~/.ir_datasets/clueweb09/corpus.chk/ClueWeb09_English_10/en0131/ originally. I just copied the empty template there and this problem went away. But now it says

Traceback (most recent call last):
  File "acquire_document_texts.py", line 28, in <module>
    tmp = docstore.get(docid)
  File "/home/username/miniconda3/lib/python3.8/site-packages/ir_datasets/indices/base.py", line 13, in get
    raise KeyError(f'doc_id={doc_id} not found')
KeyError: 'doc_id=clueweb09-en0000-00-05965 not found'

Seems like it's not traversing the file when the checkpoint is empty.

seanmacavaney commented 2 years ago

Weird. It's hard for me to know if it's a problem with your version of the source files or the ir_datasets code somehow.

I wonder if the source file is substantially different in size (suggesting some documents are missing). Can you run:

ls -l ~/.ir_datasets/clueweb09/corpus/ClueWeb09_English_1/en0000/00.warc.gz

(mine is 168967171, ~162M)

Maybe also see if the document is in the source file directly like this?

zcat ~/.ir_datasets/clueweb09/corpus/ClueWeb09_English_1/en0000/00.warc.gz | grep -a clueweb09-en0000-00-05965

Finally, you could try the patch I just made to the branch https://github.com/allenai/ir_datasets/tree/issue-134, in case it's a problem with the checkpoint template file workaround. You'll need to set the environment variable IR_DATASETS_CW_SKIP_CHK=1.

PxYu commented 2 years ago

The patch works for me.

My file is slightly smaller than your version, but the documents were all there. Not sure why though...

Anyway, thank you for being very patient with me! I got no further questions here, so feel free to close this issue.

seanmacavaney commented 2 years ago

No problem-- I'm glad it's working. Sorry that the template workaround was more of a hassle than it was worth.

I'll close the issue, but can you keep me posted if you find anything about the different compression? It would be helpful to know if there are two "official" versions of the corpus going around, in which case we could provide built-in support for both versions.