"invalid distance too far back" when seek/read a large file

dictzip / dictzip-java

DictZip, GZip random access compression format(.dz), access library for Java

https://codeberg.org/miurahr/dictzip-java

Other

12 stars 2 forks source link

"invalid distance too far back" when seek/read a large file #24

Closed geniot closed 2 years ago

geniot commented 3 years ago

I have a 45Mb text file that I compress with DictZipOutputStream to a 11Mb file. I then try to seek and read it with DictZipInputStream. I get:

Caused by: java.util.zip.ZipException: invalid distance too far back
    at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
    at org.dict.zip.DictZipInputStream.read(DictZipInputStream.java:154)
    at org.dict.zip.DictZipInputStream.readFully(DictZipInputStream.java:194)
    at org.dict.zip.DictZipInputStream.readFully(DictZipInputStream.java:179)

Could you try to dictzip a large file and seek to the end and read the last bytes. There's something wrong with the header I think.

Strangely though, if I try to read the whole file with seek(0) , read(bbs.length) I get all data. So data is in the file but seek-read sequence is not working correctly on a large file.

miurahr commented 3 years ago

Does #25 reproduce your case?

geniot commented 3 years ago

Yes, it does. Both on Windows 10 and Ubuntu. Thanks for this test. It gives me the same exception. Right now I'm trying to find the bug. It seems like BUF_LEN = 58315 is a very important constant. It should be used both for deflation and inflation.

geniot commented 3 years ago

Or did you mean that this test passes successfully on your computer? In this case what is the JDK that you are using? I realize that this exception may be connected to https://bugs.openjdk.java.net/browse/JDK-8200671 But I tried different versions and providers and even operating systems (Linux and Windows).

miurahr commented 2 years ago

As a result of investigation on test case, compression with dictzip-java produce broken archive. It is because dictzip linux command failed to read from generated archive file.

Please consider dictzip linux command for alternative for data compression. https://linux.die.net/man/1/dictzip https://packages.ubuntu.com/impish/dictzip https://sourceforge.net/projects/dict/

geniot commented 2 years ago

Thank you. I decided to write my own implementation: https://github.com/geniot/elex/blob/master/src/main/java/io/github/geniot/elex/ezip/model/ElexDictionary.java It was inspired by dictzip I think. The dictionaries here https://elex.mobi/index.html are zipped using ElexDictionary. Header is located at the end of a RandomAccessFile and it contains chunk offsets and headword "starters".

miurahr commented 2 years ago

I think #38 fix the bug. The root causes are two;

dictzip should use FULL_FLUSH for deflater flag but it has used SYNC_FLUSH.
dictzip should FLUSH exactly in chunk size boundary, but it has been flush on each write call.

miurahr commented 2 years ago

v0.10.3 with fix is out.