danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.32k stars 166 forks source link

EPUB files are completely uncompressed #169

Closed bmaupin closed 6 months ago

bmaupin commented 6 months ago

Environment

Description

I just generated an EPUB file and I was surprised how big it is. When I examined it, it appeared that all of the files in the EBUB were uncompressed.

It's been a while since I've dug into the EPUB specification, but from what I remember the only restriction is that the mimetype file must come first. There are no restrictions regarding compression, so the EPUB files generated by this tool are unnecessarily large.

For now as a workaround I'm repackaging the EPUB files myself, e.g.

unzip before.epub -d tmp-epub
cd tmp-epub
zip -X ../after.epub mimetype
zip -rX ../after.epub * -x mimetype
cd ..
rm -rf tmp-epub

Here's a quick idea of the size difference for just one example file after compressing it with standard zip compression:

-rw-rw-r-- 1 user user 559K May  6 13:38 after.epub
-rw-rw-r-- 1 user user 2.4M May  6 10:15 before.epub

And lest I forget to mention it, thank you for this tool! This is exactly what I was looking for.

danburzo commented 6 months ago

Thanks for opening this issue. According to the EPUB 3.3 spec, the entries can be either stored (uncompressed) or Deflate-compressed. In addition, the mimetype file must not be compressed.

I don’t remember the circumstances exactly, but maybe I couldn’t figure out at the time how to achieve this variable compression with archiver.js? In any case, it’s worth revisiting.

danburzo commented 6 months ago

I don’t think there’s anything else to it than that, made tentative release percollate@v4.2.0. I’ll investigate further if these settings indeed offer the best compression, or if there’s something more that can be done.