Closed thery closed 3 years ago
Yes, there are a few things that we could do, some easier than others:
To get a best-case estimate, could you try gzip-ing the two files and seeing how large each of them are?
You mean the proviola (18M/320K) while the alectryon (228M/7M)?
Yes, absolutely. Thanks! One more experiment: if you enable caching, how large is the resulting cache file, and how much does it compress? (you can enable caching with --cache-directory
)
Not sure how I do this with firefox. I did a clear cache before. It says it is 400K. after loading the file it says 1,4M
Oh, sorry: I meant enabling caching in Alectryon: Alectryon can speed up repetitive compilations by saving goals and responses as JSON files.
Did the following command
~/git/alectryon/alectryon.py --frontend coqdoc shanoi4.v --cache-directory ./cache/
but the html file is about the same size and there is nothing in the cache directory
You're right, there ~is~ was no support for caching for no-reStructuredText documents. I've added that just now. Can you try again and let me know?
Here's what I see on a mid-sized file:
$ mkdir tmp; cd tmp
$ cp $(opam config var coq:lib)/theories/Lists/List.v ./List.v
$ …/alectryon.py --frontend coqdoc --output-directory . --cache-directory . List.v
$ gzip -9 --keep List.html List.v.cache
$ ls -lh List*
-rw-rw-r-- 1 clement clement 4.0M Jul 8 23:09 List.html
-rw-rw-r-- 1 clement clement 109K Jul 8 23:09 List.html.gz
-rw-r--r-- 1 clement clement 92K Jul 8 23:09 List.v
-rw-rw-r-- 1 clement clement 3.2M Jul 8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 96K Jul 8 23:09 List.v.cache.gz
So on an example like List.v
proper movie files ("caches") are not much smaller than HTML files (3.2MB vs 4.0MB, and the difference is smaller once compressed at 96K vs 109K).
Same conclusions, though with an even more pronounced reduction in size, for a file known to give Alectryon trouble:
$ cp $(opam config var coq:lib)/theories/Reals/Ranalysis3.v ./
$ …/alectryon.py --frontend coqdoc --output-directory . --cache-directory . Ranalysis3.v
$ gzip -9 --keep Ranalysis3.html Ranalysis3.v.cache
$ ls -lh Ranalysis3*
-rw-rw-r-- 1 clement clement 25M Jul 8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 359K Jul 8 23:19 Ranalysis3.html.gz
-rw-r--r-- 1 clement clement 29K Jul 8 23:17 Ranalysis3.v
-rw-rw-r-- 1 clement clement 21M Jul 8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 278K Jul 8 23:18 Ranalysis3.v.cache.gz
Oh, and if you're curious, here's a comparison of brotli vs gzip:
-rw-rw-r-- 1 clement clement 4.0M Jul 8 23:09 List.html
-rw-rw-r-- 1 clement clement 74K Jul 8 23:09 List.html.br
-rw-rw-r-- 1 clement clement 109K Jul 8 23:09 List.html.gz
-rw-rw-r-- 1 clement clement 3.2M Jul 8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 59K Jul 8 23:09 List.v.cache.br
-rw-rw-r-- 1 clement clement 96K Jul 8 23:09 List.v.cache.gz
-rw-rw-r-- 1 clement clement 25M Jul 8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 48K Jul 8 23:19 Ranalysis3.html.br
-rw-rw-r-- 1 clement clement 359K Jul 8 23:19 Ranalysis3.html.gz
-rw-rw-r-- 1 clement clement 21M Jul 8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 17K Jul 8 23:18 Ranalysis3.v.cache.br
-rw-rw-r-- 1 clement clement 278K Jul 8 23:18 Ranalysis3.v.cache.gz
So it doesn't look like caches as currently implemented help much versus plain HTML in terms of size. Also, it looks like if your server supports compression you're not in too bad a place (especially if it supports brotli, which beatz gzip by 7x in the Ranalysis3
case).
If you need smaller files, e.g. to keep them in a git repo, then caches don't immediately help, because they're not designed for size (they're pretty-printed, among other things). Let's see if we can make them smaller. Here's what happens when you just remove the blanks:
$ python3 -c 'import json, sys;json.dump(json.load(sys.stdin), sys.stdout)' < Ranalysis3.v.cache > Ranalysis3.v.cache.min
$ python3 -c 'import json, sys;json.dump(json.load(sys.stdin), sys.stdout)' < List.v.cache > List.v.cache.min
$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul 8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul 8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 1.5M Jul 8 23:25 List.v.cache.min
-rw-rw-r-- 1 clement clement 25M Jul 8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 21M Jul 8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 9.5M Jul 8 23:25 Ranalysis3.v.cache.min
That's a bit more than 50% off; minifying further, using a less verbose format, we can save about 35% less:
$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul 8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul 8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 888K Jul 8 23:32 List.v.cache.min2
-rw-rw-r-- 1 clement clement 25M Jul 8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 21M Jul 8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 6.3M Jul 8 23:32 Ranalysis3.v.cache.min2
and finally with a simple sed script to try to gauge further savings I get an extra 25%:
$ ls -lh *.cache* *.html
-rw-rw-r-- 1 clement clement 4.0M Jul 8 23:09 List.html
-rw-rw-r-- 1 clement clement 3.2M Jul 8 23:09 List.v.cache
-rw-rw-r-- 1 clement clement 522K Jul 8 23:43 List.v.cache.xs
-rw-rw-r-- 1 clement clement 25M Jul 8 23:19 Ranalysis3.html
-rw-rw-r-- 1 clement clement 21M Jul 8 23:18 Ranalysis3.v.cache
-rw-rw-r-- 1 clement clement 4.8M Jul 8 23:44 Ranalysis3.v.cache.xs
So, it would be reasonable, if uncompressed size is a concern, to invest in making smaller movie files. I think we could save about 80% versus uncompressed HTML. We would generate webpages that have only placeholders for the <pre>
blocks, and we'd populate them dynamically from a downloaded json file. I could write a simple minifier, but I won't have time to write the webpage loader soon (it should be relatively easy, though.). With additional cleverness we might save space currently lost on repeated strings, like when we have the same subgoal repeated ten times, but I'm not sure if we want to spend the effort.
I've done many more experiments with this. Adding a simple form of deduplication (replacing repeated goals and hypotheses with backreferences) shrinks caches quite a bit:
3.2M List.v.cache
443K List.v.cache.dedup
21M Ranalysis3.v.cache
352K Ranalysis3.v.cache.dedup
I've implemented something similar for webpages directly, and it works pretty well, too:
4.4M List.html
1.8M List.min.html
25M Ranalysis3.html
1.5M Ranalysis3.min.html
(It also makes pages load a lot faster, about 60% in the Ranalysis case).
Finally, I've added an option for cache compression. One thing that's really striking is that XZ and Brotli beat GZip by far on these examples:
21M Jul 8 23:18 Ranalysis3.v.cache
278K Jul 8 23:18 Ranalysis3.v.cache.gz
19K Jul 8 23:18 Ranalysis3.v.cache.br
Can you try the implementation that's in the wip
branch? It should be enough to pass --cache-directory .
, --html-minification
and --cache-compression=xz
.
sorry I've been busy lately. I will try to test asap (firtunately by the end of the week)
OK, I've merged the changes; you can use --html-minification
. Please try it and we'll refine the feature directly on master
.
very good! With --html-minification
my files have really shrunk. The largest one is now only 3M.
Here is the percent of the size of the old versions with respect to the size of the new ones.
triangular.html 133.00 %
phi.html 147.00 %
extra.html 220.00 %
psi.html 391.00 %
ghanoi3.html 443.00 %
gdist.html 458.00 %
shanoi.html 532.00 %
ghanoi4.html 565.00 %
ghanoi.html 638.00 %
star.html 728.00 %
rhanoi3.html 1616.00 %
lhanoi3.html 2627.00 %
shanoi4.html 7572.00 %
rhanoi4.html 8158.00 %
:tada:
Woohoo! :) Thanks a lot for testing.
Alectryon seems to generate big files. I realized that when github.io told me I could not transfer file > 100MB. This file is 228M while its equivalent in Proviola was only 18M Is it possible to do something?