buggins / coolreader

Official site of CoolReader project. Sourceforge repository is obsolete.
GNU General Public License v2.0
387 stars 101 forks source link

fix deflation of epub files with abnormal zip header #315

Open fred777 opened 3 years ago

fred777 commented 3 years ago

This fixes zip deflation bug triggered by some watermarked, DRM-free epub file which I bought from a book store. It seems to have been created with Adobe Indesign 9.2.

Cool Reader standalone (as well as KOReader on my Tolino based on crengine, see related issue https://github.com/koreader/koreader/issues/8222) was unable to display it while other readers like the one from Calibre does not have issues. As soon as I extract all contents and re-compress them into a new epub file, Cool Reader did not have issues anymore, so the culprit lies in the binary zip structure of the epub file.

It turned out that packSize was detected as zero within LVZipDecodeStream::Create while unpSize was detected correctly, so the existing workaround did not kick in => I have separated it into two distinct if clauses.

Unfortunately I can't give you the epub due to the watermarking and copyright :-/

virxkane commented 3 years ago

Please check InfoZip AppNote 6.3.9:

   4.4.8 compressed size: (4 bytes)
   4.4.9 uncompressed size: (4 bytes)

       The size of the file compressed (4.4.8) and uncompressed,
       (4.4.9) respectively.  When a decryption header is present it 
       will be placed in front of the file data and the value of the
       compressed file size will include the bytes of the decryption
       header.  If bit 3 of the general purpose bit flag is set, 
       these fields are set to zero in the local header and the 
       correct values are put in the data descriptor and
       in the central directory.
       .....................
        Bit 3: If this bit is set, the fields crc-32, compressed 
               size and uncompressed size are set to zero in the 
               local header.  The correct values are put in the 
               data descriptor immediately following the compressed
               data.  (Note: PKZIP version 2.04g for DOS only 
               recognizes this bit for method 8 compression, newer 
               versions of PKZIP recognize this bit for any 
               compression method.)

Can you check this flag with this files? https://github.com/buggins/coolreader/blob/682fb3e34e48bd16f60107708bb8686713fd65e2/crengine/src/lvstream/ziphdr.h#L24 Perhaps you need to add yet another workaround for this case.

virxkane commented 3 years ago

Also, please format the code to match the formatting of this file (lvzipdecodestream.cpp). What is this block for? The condition was removed, but the block remained. I would like the files to be formatted in the same style.

https://github.com/fred777/coolreader/blob/fdcf0a6f638e1d45818d64b554dd30fc5278935b/crengine/src/lvstream/lvzipdecodestream.cpp#L383-L388

Thanks for the fix. It is inconvenient that you cannot provide the problematic file. This means you need to check that there are no regressions :).

Will there be no problems with catalogs/directories with these conditions?

fred777 commented 3 years ago

Regarding the formatting - will fix that as you prefer - I just kept the block for cosmetic reasons, i.e. to indicate where the comment is related to...

Regressions - not sure how to deal with it - if you have a test suite somewhere or can give some files that you suspect to be problematic I can check them.

fred777 commented 3 years ago

The value of hdr.Flags is 8 for META-INF/content.xml => only bit 4 is set. The full contents of the hdr struct is:

        hdr @0x7ffe6d2396c0 ZipLocalFileHdr
            Flags   8   lUInt16
            Mark    67324752    lUInt32
            UnpOS   0   lUInt8
            UnpVer  20  lUInt8
            others  @0x7ffe6d2396c8 lUInt16[11]
                [0] 8   lUInt16
                [1] 20672   lUInt16
                [2] 17665   lUInt16
                [3] 0   lUInt16
                [4] 0   lUInt16
                [5] 0   lUInt16
                [6] 0   lUInt16
                [7] 266 lUInt16
                [8] 0   lUInt16
                [9] 22  lUInt16
                [10]    0   lUInt16

...and this is the related zipinfo output:

Central directory entry #122:
---------------------------

  There are an extra 16 bytes preceding this file.

  META-INF/container.xml

  offset of local header from start of archive:   3060847
                                                  (00000000002EB46Fh) bytes
  file system or operating system of origin:      MS-DOS, OS/2 or NT FAT
  version of encoding software:                   2.0
  minimum file system compatibility required:     MS-DOS, OS/2 or NT FAT
  minimum software version required to extract:   2.0
  compression method:                             deflated
  compression sub-type (deflation):               normal
  file security status:                           not encrypted
  extended local header:                          yes
  file last modified on (DOS date/time):          2014 Aug 1 10:06:00
  32-bit CRC value (hex):                         053eef39
  compressed size:                                188 bytes
  uncompressed size:                              266 bytes
  length of filename:                             22 characters
  length of extra field:                          0 bytes
  length of file comment:                         0 characters
  disk number on which file begins:               disk 1
  apparent file type:                             binary
  non-MSDOS external file attributes:             000000 hex
  MS-DOS file attributes (00 hex):                none

  There is no file comment.

Not sure if this helps...

virxkane commented 3 years ago

Regressions - not sure how to deal with it - if you have a test suite somewhere or can give some files that you suspect to be problematic I can check them.

Unfortunately, we do not have any test suite, so you can just check the functionality on other zip files. Although your code looks safe, I think it shouldn't be a problem. How often have I thought so, and how often have I been wrong :) But OK, I don't mind. But we'll have to wait for @buggins, I don't have an agreement with him about accepting other people's PR.

virxkane commented 3 years ago

The value of hdr.Flags is 8 for META-INF/content.xml => only bit 4 is set. The full contents of the hdr struct is:

You can also uncomment this line: https://github.com/buggins/coolreader/blob/682fb3e34e48bd16f60107708bb8686713fd65e2/crengine/src/lvstream/lvziparc.cpp#L364 and see in console some debug messages including contents of the extra fields. Description of this extra field you can find in InfoZip AppNote.

poire-z commented 3 years ago

Looks quite innofensive to me. I haven't looked a the code around, but the only situation that could feel it needs a test is with a file with empty content inside the zip (where I expect unpSize to be 0 while packSize might not be).

fred777 commented 3 years ago

There you go:

2021/09/19 19:19:16.2621 TRACE ZIP entry 'META-INF/container.xml' unpSz=266, pSz=188, m=8, offs=3060847, zAttr=0, flg=0, addL=0, commL=0, dn=0

For your convenience I also tried an empty epub file (aka empty zip file) => no crashes / errors, I guess because it will never try to read META-INF/container.xml in this cause ;-) The first thing it tries to extract from epub zip is mimedata in zip root - which is not present in this case here, so the code will exit early.

poire-z commented 3 years ago

TRACE ZIP entry 'META-INF/container.xml' unpSz=266, pSz=188

That was with an empty META-INF/container.xml ? So, unpSz is not the size of the file content, but the size of the zip item/entry, which must includes the date and other attributes, so always non-zero, right ?

virxkane commented 3 years ago

So, unpSz is not the size of the file content, but the size of the zip item/entry, which must includes the date and other attributes, so always non-zero, right ?

Isn't this information present in localzipheader and extra fields? This is exactly the size of the file. See appnote infozip, section 4.4.9.

fred777 commented 3 years ago

No, that was with the original epub file. With empty container.xml the result is

2021/09/19 19:41:33.0281 TRACE ZIP entry 'META-INF/container.xml' unpSz=0, pSz=0, m=0, offs=780464, zAttr=0, flg=81800000, addL=24, commL=0, dn=0
2021/09/19 19:41:33.0281 TRACE   ZIP entry extra data: :55:54:05:00:03:77:75:47:61:75:78:0B:00:01:04:E8:03:00:00:04:64:00:00:00

=> notice that the original epub file does not have extra header data - opposed to the modified file I just created with empty container.xml

poire-z commented 3 years ago

OK, I don't want to put my head in the specs, so trusting you that your small fix is sane with various combinations of packSize / unpSize when only one is zero, and that it is sane to use only one of srcPackSize / srcUnpSize :)