cpitclaudel / alectryon

A collection of tools for writing technical documents that mix Coq code and prose.
MIT License
228 stars 36 forks source link

Large file #35

Closed thery closed 3 years ago

thery commented 3 years ago

Alectryon seems to generate big files. I realized that when github.io told me I could not transfer file > 100MB. This file is 228M while its equivalent in Proviola was only 18M Is it possible to do something?

cpitclaudel commented 3 years ago

Yes, there are a few things that we could do, some easier than others:

To get a best-case estimate, could you try gzip-ing the two files and seeing how large each of them are?

thery commented 3 years ago

You mean the proviola (18M/320K) while the alectryon (228M/7M)?

cpitclaudel commented 3 years ago

Yes, absolutely. Thanks! One more experiment: if you enable caching, how large is the resulting cache file, and how much does it compress? (you can enable caching with --cache-directory)

thery commented 3 years ago

Not sure how I do this with firefox. I did a clear cache before. It says it is 400K. after loading the file it says 1,4M

cpitclaudel commented 3 years ago

Oh, sorry: I meant enabling caching in Alectryon: Alectryon can speed up repetitive compilations by saving goals and responses as JSON files.

thery commented 3 years ago

Did the following command

~/git/alectryon/alectryon.py --frontend coqdoc shanoi4.v --cache-directory ./cache/

but the html file is about the same size and there is nothing in the cache directory

cpitclaudel commented 3 years ago

You're right, there ~is~ was no support for caching for no-reStructuredText documents. I've added that just now. Can you try again and let me know?

Here's what I see on a mid-sized file:

$ mkdir tmp; cd tmp
$ cp $(opam config var coq:lib)/theories/Lists/List.v ./List.v
$ …/alectryon.py --frontend coqdoc --output-directory . --cache-directory . List.v
$ gzip -9 --keep List.html List.v.cache
$ ls -lh List*
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 109K Jul  8 23:09 List.html.gz
-rw-r--r-- 1 clement clement  92K Jul  8 23:09 List.v
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement  96K Jul  8 23:09 List.v.cache.gz

So on an example like List.v proper movie files ("caches") are not much smaller than HTML files (3.2MB vs 4.0MB, and the difference is smaller once compressed at 96K vs 109K).

Same conclusions, though with an even more pronounced reduction in size, for a file known to give Alectryon trouble:

$ cp $(opam config var coq:lib)/theories/Reals/Ranalysis3.v ./
$ …/alectryon.py --frontend coqdoc --output-directory . --cache-directory . Ranalysis3.v 
$ gzip -9 --keep Ranalysis3.html Ranalysis3.v.cache 
$ ls -lh Ranalysis3*
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 359K Jul  8 23:19 Ranalysis3.html.gz
-rw-r--r-- 1 clement clement  29K Jul  8 23:17 Ranalysis3.v
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 278K Jul  8 23:18 Ranalysis3.v.cache.gz

Oh, and if you're curious, here's a comparison of brotli vs gzip:

-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement  74K Jul  8 23:09 List.html.br
-rw-rw-r-- 1 clement clement 109K Jul  8 23:09 List.html.gz
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement  59K Jul  8 23:09 List.v.cache.br
-rw-rw-r-- 1 clement clement  96K Jul  8 23:09 List.v.cache.gz
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  48K Jul  8 23:19 Ranalysis3.html.br
-rw-rw-r-- 1 clement clement 359K Jul  8 23:19 Ranalysis3.html.gz
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement  17K Jul  8 23:18 Ranalysis3.v.cache.br
-rw-rw-r-- 1 clement clement 278K Jul  8 23:18 Ranalysis3.v.cache.gz

So it doesn't look like caches as currently implemented help much versus plain HTML in terms of size. Also, it looks like if your server supports compression you're not in too bad a place (especially if it supports brotli, which beatz gzip by 7x in the Ranalysis3 case).

If you need smaller files, e.g. to keep them in a git repo, then caches don't immediately help, because they're not designed for size (they're pretty-printed, among other things). Let's see if we can make them smaller. Here's what happens when you just remove the blanks:

$ python3 -c 'import json, sys;json.dump(json.load(sys.stdin), sys.stdout)' < Ranalysis3.v.cache > Ranalysis3.v.cache.min 
$ python3 -c 'import json, sys;json.dump(json.load(sys.stdin), sys.stdout)' < List.v.cache > List.v.cache.min 
$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 1.5M Jul  8 23:25 List.v.cache.min
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 9.5M Jul  8 23:25 Ranalysis3.v.cache.min

That's a bit more than 50% off; minifying further, using a less verbose format, we can save about 35% less:

$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 888K Jul  8 23:32 List.v.cache.min2
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 6.3M Jul  8 23:32 Ranalysis3.v.cache.min2

and finally with a simple sed script to try to gauge further savings I get an extra 25%:

$ ls -lh *.cache* *.html 
-rw-rw-r-- 1 clement clement 4.0M Jul  8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul  8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 522K Jul  8 23:43 List.v.cache.xs
-rw-rw-r-- 1 clement clement  25M Jul  8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement  21M Jul  8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 4.8M Jul  8 23:44 Ranalysis3.v.cache.xs

So, it would be reasonable, if uncompressed size is a concern, to invest in making smaller movie files. I think we could save about 80% versus uncompressed HTML. We would generate webpages that have only placeholders for the <pre> blocks, and we'd populate them dynamically from a downloaded json file. I could write a simple minifier, but I won't have time to write the webpage loader soon (it should be relatively easy, though.). With additional cleverness we might save space currently lost on repeated strings, like when we have the same subgoal repeated ten times, but I'm not sure if we want to spend the effort.

cpitclaudel commented 3 years ago

I've done many more experiments with this. Adding a simple form of deduplication (replacing repeated goals and hypotheses with backreferences) shrinks caches quite a bit:

3.2M List.v.cache
443K List.v.cache.dedup

 21M Ranalysis3.v.cache
352K Ranalysis3.v.cache.dedup    

I've implemented something similar for webpages directly, and it works pretty well, too:

4.4M List.html
1.8M List.min.html

 25M Ranalysis3.html
1.5M Ranalysis3.min.html

(It also makes pages load a lot faster, about 60% in the Ranalysis case).

Finally, I've added an option for cache compression. One thing that's really striking is that XZ and Brotli beat GZip by far on these examples:

 21M Jul  8 23:18 Ranalysis3.v.cache
278K Jul  8 23:18 Ranalysis3.v.cache.gz
 19K Jul  8 23:18 Ranalysis3.v.cache.br

Can you try the implementation that's in the wip branch? It should be enough to pass --cache-directory ., --html-minification and --cache-compression=xz.

thery commented 3 years ago

sorry I've been busy lately. I will try to test asap (firtunately by the end of the week)

cpitclaudel commented 3 years ago

OK, I've merged the changes; you can use --html-minification. Please try it and we'll refine the feature directly on master.

thery commented 3 years ago

very good! With --html-minification my files have really shrunk. The largest one is now only 3M. Here is the percent of the size of the old versions with respect to the size of the new ones.

triangular.html 133.00 %
phi.html    147.00 %
extra.html  220.00 %
psi.html    391.00 %
ghanoi3.html    443.00 %
gdist.html  458.00 %
shanoi.html 532.00 %
ghanoi4.html    565.00 %
ghanoi.html 638.00 %
star.html   728.00 %
rhanoi3.html    1616.00 %
lhanoi3.html    2627.00 %
shanoi4.html    7572.00 %
rhanoi4.html    8158.00 %

:tada:

cpitclaudel commented 3 years ago

Woohoo! :) Thanks a lot for testing.