Closed gen2brain closed 5 years ago
Is there any way to detect these formats, aside from unzipping and inspecting contents?
I think possible just by matching file extension, cbr
, cbz
, cb7
and cbt
. It is just archive with images, and sometimes also some .txt files included.
And sometimes cover is named cover.jpg
or cover.bmp
, usually, all image formats are supported, i.e. jpg, png, gif, tiff and bmp.
This library is meant to be usable with unknown client-uploaded files. Thus I want to avoid detection based on extension as an attack vector.
Ok, I understand. You can probably then check if file is archive (i.e. magic bytes), then check for cbr,cbz et c. and if positive list archive contents, match extensions/mimetypes and if all or most are images then sort list of files and take first image or cover, and check finally if that file is a real image.
None of these files have a strict included metadata spec, correct? So I take it there is no harm in identifying any zip archive with just a bunch of images in it as CBZ (and the same for other archive types)?
Correct, nothing strict and there is no some standard
way, they usually just pack a bunch of images in archive and rename that file based on archive type.
Okay, so the implementation plan should be
I should be able to do this some time next week.
That's CBZ down. Doubt I'll be able to do the other 2 this week. Most likely on the next.
I'm not seeing any size limit for the unzipping. You might want to defend against zip bombs.
@the8472 Good point.
@the8472 Does checking the uncompressed size inside the file header mitigate the zip bomb, or do I need to limit the amount of bytes read as well? What do you think a sane limit?
That would depend on whether the decompressor checks that the size in the metadata is not exceeded. As a general principle I wouldn't trust it unless the documentation says otherwise.
For the limit I would say one archive entry shouldn't decompress to more than the total upload limit.
CBZ and CBR support done. CB7 dropped for lack of a suitable Go library and I'm not writing my own.
@gen2brain Feel free to write a library, which does not depend on precompiled binaries and is compatible with Go modules.
CBT is not even a compressed or standard format, so I won't bother.
Comicbook archives are just zip/rar/7z/tar archives with images inside. Readers usually just take a list of files in archives, sort it in a natural order and take the first image, usually that one is the correct cover image.
This is the list of mimetypes:
I have a wrapper for
unarr
here https://github.com/gen2brain/go-unarr that can be used (it has a bug with LZMA2 currently) or in native Go https://github.com/mholt/archiver (doesn't have 7z support, but such comic books are rare anyway).