allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
316 stars 42 forks source link

lz4f_decompress failed with code: ERROR_frametype_unknown #222

Closed yogeswarl closed 1 year ago

yogeswarl commented 1 year ago

Describe the bug Hi, I am currently using the aolia tools with Ir-datasets. I spent two days downloading the documents using the internet archive. One thing I noticed is that the aolia-tools could not fetch all the docs successfully(approximately 25 links kept failing). I found a workaround telling me to add the _done file to my code to get the ir_datasets to start fetching documents. In doing so, I ended up with a new and very strange error for which I could not find the solution. could anyone please help me find it?

Affected dataset(s) aol-ia To Reproduce Steps to reproduce the behaviour:

  1. Download docs using aolia-tools.
  2. use the docs_iter() to iterate over the the documents
  3. The error "LZ4F_decompress failed with code: ERROR_frametype_unknown" is thrown"
  4. An image to show the error:
  5. image

Expected behaviour It should run without any problem.

I would be happy if you could provide me with a workaround.

seanmacavaney commented 1 year ago

Interesting; thanks for reporting.

Can you try removing the docs.pklz4 directory and try again; perhaps there was an issue when building it.

rm -r ~/.ir_datasets/aol-ia/docs.pklz4/

If that doesn't work, perhaps one of the source files in ~/.ir_datasets/aol-ia/downloaded_docs/ may be corrupted. I'll think a bit more about possible workarounds for that.

yogeswarl commented 1 year ago

Interesting; thanks for reporting.

Can you try removing the docs.pklz4 directory and try again; perhaps there was an issue when building it.


rm -r ~/.ir_datasets/aol-ia/docs.pklz4/

If that doesn't work, perhaps one of the source files in ~/.ir_datasets/aol-ia/downloaded_docs/ may be corrupted. I'll think a bit more about possible workarounds for that.

Thank you for your quick response. I found out one of my files was corrupted. I took the time to run the LZ decompression CLI to decompress. One file had a corrupted line so it failed.