Add support for comicbook archives - Githubissues

bakape / thumbnailer

Go media thumbnailer

MIT License

154 stars 36 forks source link

Add support for comicbook archives #51

Closed gen2brain closed 5 years ago

gen2brain commented 5 years ago

Comicbook archives are just zip/rar/7z/tar archives with images inside. Readers usually just take a list of files in archives, sort it in a natural order and take the first image, usually that one is the correct cover image.

This is the list of mimetypes:

application/x-cbz
application/vnd.comicbook+zip

application/x-cbr
application/vnd.comicbook-rar

application/x-cb7
application/x-cbt

I have a wrapper for unarr here https://github.com/gen2brain/go-unarr that can be used (it has a bug with LZMA2 currently) or in native Go https://github.com/mholt/archiver (doesn't have 7z support, but such comic books are rare anyway).

bakape commented 5 years ago

Is there any way to detect these formats, aside from unzipping and inspecting contents?

gen2brain commented 5 years ago

I think possible just by matching file extension, cbr, cbz, cb7 and cbt. It is just archive with images, and sometimes also some .txt files included.

gen2brain commented 5 years ago

And sometimes cover is named cover.jpg or cover.bmp, usually, all image formats are supported, i.e. jpg, png, gif, tiff and bmp.

bakape commented 5 years ago

This library is meant to be usable with unknown client-uploaded files. Thus I want to avoid detection based on extension as an attack vector.

gen2brain commented 5 years ago

Ok, I understand. You can probably then check if file is archive (i.e. magic bytes), then check for cbr,cbz et c. and if positive list archive contents, match extensions/mimetypes and if all or most are images then sort list of files and take first image or cover, and check finally if that file is a real image.

bakape commented 5 years ago

None of these files have a strict included metadata spec, correct? So I take it there is no harm in identifying any zip archive with just a bunch of images in it as CBZ (and the same for other archive types)?

gen2brain commented 5 years ago

Correct, nothing strict and there is no some standard way, they usually just pack a bunch of images in archive and rename that file based on archive type.

bakape commented 5 years ago

Okay, so the implementation plan should be

[x] Mime type detection for ZIP, RAR and 7zip archives (already have ZIP and 7z matcher functions in meguca - can move them from there)
[x] If archive detected, the matcher should decompress and do an extra check for comic book formats. 90% of root level file types should be images.
- [x] Needs to be piped to disk to conserve memory usage, if extraction is needed.
[x] Extract first image and pass down the regular pipeline.
[x] CBZ
- [x] ZIP bomb mitigation
[x] CBR
- [x] ZIP bomb mitigation

I should be able to do this some time next week.

bakape commented 5 years ago

That's CBZ down. Doubt I'll be able to do the other 2 this week. Most likely on the next.

the8472 commented 5 years ago

I'm not seeing any size limit for the unzipping. You might want to defend against zip bombs.

bakape commented 5 years ago

@the8472 Good point.

bakape commented 5 years ago

@the8472 Does checking the uncompressed size inside the file header mitigate the zip bomb, or do I need to limit the amount of bytes read as well? What do you think a sane limit?

the8472 commented 5 years ago

That would depend on whether the decompressor checks that the size in the metadata is not exceeded. As a general principle I wouldn't trust it unless the documentation says otherwise.

For the limit I would say one archive entry shouldn't decompress to more than the total upload limit.

bakape commented 5 years ago

CBZ and CBR support done. CB7 dropped for lack of a suitable Go library and I'm not writing my own. @gen2brain Feel free to write a library, which does not depend on precompiled binaries and is compatible with Go modules.
CBT is not even a compressed or standard format, so I won't bother.