ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
350 stars 72 forks source link

Keep and upload database of finished jobs #465

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago

When a job crashes or is aborted, its DB and therefore its queue is simply deleted while the other things (WARC so far, JSON, and since #396 the log file) are kept. I think we should also retain the database. It may even be worth considering keeping it for all jobs.

Besides preserving the remaining queue for crashed and aborted jobs, it also allows for easier access to the crawl information. For example, it's much easier to extract all URLs that failed three times or that resulted in a particular status code from the DB than painful processing of the log file. It could also allow for running 'update crawls' (outside of ArchiveBot) at a later time by reusing the DB of a job to skip (some) URLs that were already retrieved without having to construct such a DB from the log file.

The obvious downside is the data/storage size. However, in the grand scheme of things, this doesn't make a big difference. As a point of reference, job 6recrrotn072khaaje73k60kh – one of the largest jobs currently running at 65 million URLs – has a DB file of 15.8 GiB. This is pretty much insignificant compared to the job's data size of 4.8 TiB, especially as compression decreases the size further by a factor 4-5 (zstd without tuning: 3.57 GiB or 22.6 %). So this is an increase in data per job on the order of 1 ‰ (except in the rare extreme cases where the vast majority of URLs is ignored).

Arkiver2 commented 3 years ago

This is a great idea, I think size is not a problem here.

I'm not sure what is exactly stored in the DB, any sensitive information? If not, we should definitely preserve the DB (also on finished jobs).

JustAnotherArchivist commented 3 years ago

Nothing sensitive at all. It contains only information on the crawl itself that could in theory be regenerated using the source code and the WARCs (but you might turn suicidal trying to do that): URLs, their relations (parent, root), recursion info (level, inline level), crawl info (status, try count, priority [currently unused]), and some info on the content ('link type', status code). POST data and local filenames would also end up there but are not used by AB. Sometime in the future, cookies will also be there, but again, nothing that couldn't be reconstructed from the WARCs anyway (and the IRC commands if we add manual cookie control).

Arkiver2 commented 3 years ago

It's pretty difficult (and would have to process a ton of data) to reconstruct this. Size is not a problem (relative to total WARC size). Since there's nothing sensitive in the DB, let's do it.

We could gzip it and upload together with the JSON and WARCs.

JustAnotherArchivist commented 3 years ago

Yep. It's possible in theory but completely unfeasible in practice.

I'll play around with gzip vs zstd a bit. It'll be a .db.gz or .db.zst file with the same filename structure as everything else.

JustAnotherArchivist commented 3 years ago

I ran a few tests on large-ish databases on a busy pipeline in a terminal:

Job Original size gzip -6 size ... time gzip -9 size ... time zstd size ... time
1m71j820n4qka3ob7w6dlja3y 23.7 GiB 3.93 GiB 17 mn 3.58 GiB 2.5 mn
9hdfwijhzx86os1k3tm1wgq3i 1034 MiB 241 MiB 33 s 239 MiB 50 s 221 MiB 4.3 s
2g2xqrj2na5od7mk5mql0q3bn 326 MiB 67.3 MiB 10 s 66.7 MiB 23 s 64.7 MiB 2.3 s
73pjjo1i8uyububkhbpaf6ndr 5.00 GiB 0.915 GiB 2 mn 20 s 0.890 GiB 30 s

The implications are pretty obvious.

I'll probably switch the log compression (on crashed/aborted jobs) to zstd as well. Although zstd actually produces a larger file than even gzip -6 with the default settings in a test, it only takes a slight increase of the compression level to fix that. zstd -10 takes about the same time as gzip -9 on my partial test log from job 9hdfwijhzx86os1k3tm1wgq3i (1010 MiB) at 33 s but produces a file of 85 MiB compared to gzip's 99 MiB. I'll do some more testing to find the sweet spot there.

JustAnotherArchivist commented 3 years ago

I looked into this a bit again. I took the DB from 5nbpflkse0rs1tlgch8n4efud (2.94 GB, 13 million URLs, runtime before crashing about a week) and the partial log file from 3pwf0useacbmua9uwp4idpale (3.64 GB, 12 million URLs, runtime about a month so far) and compressed them at most levels of zstd and gzip. I ran this on a fairly busy AB pipeline (jap-kakapo), so it should be representative of what the runtime might look like in reality. The jobs are obviously among the larger ones running through AB. My analysis consisted of staring at shitty graphs of user time vs compression ratio in LibreOffice Calc.

Test results ## Database ### zstd | Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time | |---|---|---|---|---|---|---| | 1 | 2944745472 | 721835831 | 24.51% | 13.616 | 11.565 | 1.751 | | 2 | 2944745472 | 700663107 | 23.79% | 13.192 | 13.300 | 1.097 | | 3 | 2944745472 | 682199494 | 23.17% | 16.743 | 16.700 | 1.231 | | 4 | 2944745472 | 677610833 | 23.01% | 21.281 | 21.426 | 1.187 | | 5 | 2944745472 | 661601839 | 22.47% | 50.899 | 50.691 | 1.414 | | 6 | 2944745472 | 657653273 | 22.33% | 56.692 | 56.470 | 1.247 | | 7 | 2944745472 | 630368182 | 21.41% | 68.079 | 67.727 | 1.542 | | 8 | 2944745472 | 625318048 | 21.24% | 79.158 | 79.157 | 1.252 | | 9 | 2944745472 | 622723235 | 21.15% | 93.913 | 93.947 | 1.114 | | 10 | 2944745472 | 613131472 | 20.82% | 117.855 | 117.652 | 1.381 | | 11 | 2944745472 | 610937389 | 20.75% | 131.157 | 130.767 | 1.344 | | 12 | 2944745472 | 609634199 | 20.70% | 176.475 | 176.017 | 1.516 | | 13 | 2944745472 | 609777705 | 20.71% | 201.050 | 196.477 | 2.412 | | 14 | 2944745472 | 607311093 | 20.62% | 218.251 | 215.267 | 2.175 | | 15 | 2944745472 | 605756166 | 20.57% | 265.313 | 262.540 | 2.719 | | 16 | 2944745472 | 588934765 | 20.00% | 572.187 | 561.204 | 3.635 | | 17 | 2944745472 | 562606051 | 19.11% | 697.623 | 690.251 | 4.968 | | 18 | 2944745472 | 538896215 | 18.30% | 1085.334 | 1077.788 | 5.231 | | 19 | 2944745472 | 530637003 | 18.02% | 1519.945 | 1512.603 | 4.898 | ## gzip | Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time | |---|---|---|---|---|---|---| | 1 | 2944745472 | 806176600 | 27.38% | 47.730 | 42.625 | 1.891 | | 2 | 2944745472 | 800534883 | 27.19% | 47.969 | 44.378 | 1.551 | | 3 | 2944745472 | 770088717 | 26.15% | 56.418 | 54.249 | 1.833 | | 4 | 2944745472 | 736143418 | 25.00% | 65.347 | 62.334 | 1.711 | | 5 | 2944745472 | 723571018 | 24.57% | 71.107 | 68.941 | 1.759 | | 6 | 2944745472 | 717027291 | 24.35% | 89.407 | 87.594 | 1.560 | | 7 | 2944745472 | 713746787 | 24.24% | 103.271 | 100.502 | 1.680 | | 8 | 2944745472 | 711333243 | 24.16% | 126.486 | 124.023 | 1.536 | | 9 | 2944745472 | 711214985 | 24.15% | 138.508 | 134.626 | 1.927 | ## Log ### zstd (Only ran it up to level 15 because it was getting ridiculous...) | Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time | |---|---|---|---|---|---|---| | 1 | 3641670876 | 440404842 | 12.09% | 11.606 | 11.189 | 1.098 | | 2 | 3641670876 | 435763309 | 11.97% | 12.000 | 11.859 | 1.232 | | 3 | 3641670876 | 432647510 | 11.88% | 15.586 | 15.240 | 1.290 | | 4 | 3641670876 | 433149771 | 11.89% | 18.272 | 17.941 | 1.072 | | 5 | 3641670876 | 402242867 | 11.05% | 39.903 | 39.730 | 1.240 | | 6 | 3641670876 | 395880291 | 10.87% | 43.403 | 43.543 | 1.198 | | 7 | 3641670876 | 379345921 | 10.42% | 58.751 | 58.411 | 1.505 | | 8 | 3641670876 | 369124857 | 10.14% | 72.449 | 71.646 | 1.644 | | 9 | 3641670876 | 367090066 | 10.08% | 87.926 | 87.384 | 1.644 | | 10 | 3641670876 | 365891660 | 10.05% | 103.167 | 103.085 | 1.317 | | 11 | 3641670876 | 365068174 | 10.02% | 124.915 | 124.942 | 1.265 | | 12 | 3641670876 | 363906198 | 9.99% | 164.777 | 163.296 | 1.347 | | 13 | 3641670876 | 359998040 | 9.89% | 228.925 | 228.116 | 2.024 | | 14 | 3641670876 | 358985335 | 9.86% | 267.016 | 265.797 | 2.248 | | 15 | 3641670876 | 358212227 | 9.84% | 334.854 | 333.494 | 2.032 | ### gzip | Compression level | Original size | Compressed size | Compression ratio | Real time | User time | Sys time | |---|---|---|---|---|---|---| | 1 | 3641670876 | 506536391 | 13.91% | 36.188 | 33.271 | 1.340 | | 2 | 3641670876 | 493880878 | 13.56% | 32.974 | 31.712 | 1.168 | | 3 | 3641670876 | 483203714 | 13.27% | 36.171 | 33.592 | 1.383 | | 4 | 3641670876 | 452770760 | 12.43% | 48.991 | 45.913 | 1.296 | | 5 | 3641670876 | 436844810 | 12.00% | 47.157 | 45.902 | 1.175 | | 6 | 3641670876 | 418611901 | 11.50% | 63.652 | 60.297 | 1.332 | | 7 | 3641670876 | 416090448 | 11.43% | 70.472 | 68.945 | 1.268 | | 8 | 3641670876 | 400631037 | 11.00% | 88.818 | 87.520 | 1.128 | | 9 | 3641670876 | 400425421 | 11.00% | 114.291 | 112.666 | 1.244 |
(Raw terminal output in case I screwed up the tabulation somewhere) ``` > for lvl in {1..22}; do echo $lvl; time zstd -$lvl patriots.win-inf-20210123-012541-5nbpf-wpull.db -o patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst$lvl; echo; echo; done 1 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 24.51% (2944745472 => 721835831 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1) real 0m13.616s user 0m11.565s sys 0m1.751s 2 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.79% (2944745472 => 700663107 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2) real 0m13.192s user 0m13.300s sys 0m1.097s 3 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.17% (2944745472 => 682199494 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3) real 0m16.743s user 0m16.700s sys 0m1.231s 4 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.01% (2944745472 => 677610833 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4) real 0m21.281s user 0m21.426s sys 0m1.187s 5 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.47% (2944745472 => 661601839 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5) real 0m50.899s user 0m50.691s sys 0m1.414s 6 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.33% (2944745472 => 657653273 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6) real 0m56.692s user 0m56.470s sys 0m1.247s 7 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.41% (2944745472 => 630368182 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7) real 1m8.079s user 1m7.727s sys 0m1.542s 8 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.24% (2944745472 => 625318048 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8) real 1m19.158s user 1m19.157s sys 0m1.252s 9 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.15% (2944745472 => 622723235 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9) real 1m33.913s user 1m33.947s sys 0m1.144s 10 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.82% (2944745472 => 613131472 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10) real 1m57.855s user 1m57.652s sys 0m1.381s 11 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.75% (2944745472 => 610937389 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11) real 2m11.157s user 2m10.767s sys 0m1.344s 12 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.70% (2944745472 => 609634199 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12) real 2m56.475s user 2m56.017s sys 0m1.516s 13 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.71% (2944745472 => 609777705 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13) real 3m21.050s user 3m16.477s sys 0m2.412s 14 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.62% (2944745472 => 607311093 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14) real 3m38.251s user 3m35.267s sys 0m2.175s 15 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.57% (2944745472 => 605756166 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15) real 4m25.313s user 4m22.540s sys 0m2.719s 16 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.00% (2944745472 => 588934765 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16) real 9m32.187s user 9m21.204s sys 0m3.635s 17 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 19.11% (2944745472 => 562606051 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17) real 11m37.623s user 11m30.251s sys 0m4.968s 18 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.30% (2944745472 => 538896215 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18) real 18m5.334s user 17m57.788s sys 0m5.231s 19 patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.02% (2944745472 => 530637003 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19) real 25m19.945s user 25m12.603s sys 0m4.898s > for lvl in {1..9}; do echo $lvl; time gzip -$lvl patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz$lvl; echo; echo; done 1 real 0m47.730s user 0m42.625s sys 0m1.891s 2 real 0m47.969s user 0m44.378s sys 0m1.551s 3 real 0m56.418s user 0m54.249s sys 0m1.833s 4 real 1m5.347s user 1m2.334s sys 0m1.711s 5 real 1m11.107s user 1m8.941s sys 0m1.759s 6 real 1m29.407s user 1m27.594s sys 0m1.560s 7 real 1m43.271s user 1m40.502s sys 0m1.680s 8 real 2m6.486s user 2m4.023s sys 0m1.536s 9 real 2m18.508s user 2m14.626s sys 0m1.927s > for lvl in {1..19}; do echo $lvl; time zstd -$lvl 3pwf0useacbmua9uwp4idpale.log -o 3pwf0useacbmua9uwp4idpale.log.zst$lvl; echo; echo; done 1 3pwf0useacbmua9uwp4idpale.log : 12.09% (3641670876 => 440404842 bytes, 3pwf0useacbmua9uwp4idpale.log.zst1) real 0m11.606s user 0m11.189s sys 0m1.098s 2 3pwf0useacbmua9uwp4idpale.log : 11.97% (3641670876 => 435763309 bytes, 3pwf0useacbmua9uwp4idpale.log.zst2) real 0m12.000s user 0m11.859s sys 0m1.232s 3 3pwf0useacbmua9uwp4idpale.log : 11.88% (3641670876 => 432647510 bytes, 3pwf0useacbmua9uwp4idpale.log.zst3) real 0m15.586s user 0m15.240s sys 0m1.290s 4 3pwf0useacbmua9uwp4idpale.log : 11.89% (3641670876 => 433149771 bytes, 3pwf0useacbmua9uwp4idpale.log.zst4) real 0m18.272s user 0m17.941s sys 0m1.072s 5 3pwf0useacbmua9uwp4idpale.log : 11.05% (3641670876 => 402242867 bytes, 3pwf0useacbmua9uwp4idpale.log.zst5) real 0m39.903s user 0m39.730s sys 0m1.240s 6 3pwf0useacbmua9uwp4idpale.log : 10.87% (3641670876 => 395880291 bytes, 3pwf0useacbmua9uwp4idpale.log.zst6) real 0m43.403s user 0m43.543s sys 0m1.198s 7 3pwf0useacbmua9uwp4idpale.log : 10.42% (3641670876 => 379345921 bytes, 3pwf0useacbmua9uwp4idpale.log.zst7) real 0m58.751s user 0m58.411s sys 0m1.505s 8 3pwf0useacbmua9uwp4idpale.log : 10.14% (3641670876 => 369124857 bytes, 3pwf0useacbmua9uwp4idpale.log.zst8) real 1m12.449s user 1m11.646s sys 0m1.644s 9 3pwf0useacbmua9uwp4idpale.log : 10.08% (3641670876 => 367090066 bytes, 3pwf0useacbmua9uwp4idpale.log.zst9) real 1m27.926s user 1m27.384s sys 0m1.644s 10 3pwf0useacbmua9uwp4idpale.log : 10.05% (3641670876 => 365891660 bytes, 3pwf0useacbmua9uwp4idpale.log.zst10) real 1m43.167s user 1m43.085s sys 0m1.317s 11 3pwf0useacbmua9uwp4idpale.log : 10.02% (3641670876 => 365068174 bytes, 3pwf0useacbmua9uwp4idpale.log.zst11) real 2m4.915s user 2m4.942s sys 0m1.265s 12 3pwf0useacbmua9uwp4idpale.log : 9.99% (3641670876 => 363906198 bytes, 3pwf0useacbmua9uwp4idpale.log.zst12) real 2m44.777s user 2m43.296s sys 0m1.347s 13 3pwf0useacbmua9uwp4idpale.log : 9.89% (3641670876 => 359998040 bytes, 3pwf0useacbmua9uwp4idpale.log.zst13) real 3m48.925s user 3m48.116s sys 0m2.024s 14 3pwf0useacbmua9uwp4idpale.log : 9.86% (3641670876 => 358985335 bytes, 3pwf0useacbmua9uwp4idpale.log.zst14) real 4m27.016s user 4m25.797s sys 0m2.248s 15 3pwf0useacbmua9uwp4idpale.log : 9.84% (3641670876 => 358212227 bytes, 3pwf0useacbmua9uwp4idpale.log.zst15) real 5m34.854s user 5m33.494s sys 0m2.032s > for lvl in {1..9}; do echo $lvl; time gzip -$lvl <3pwf0useacbmua9uwp4idpale.log >3pwf0useacbmua9uwp4idpale.log.gz$lvl; echo; echo; done 1 real 0m36.188s user 0m33.271s sys 0m1.340s 2 real 0m32.974s user 0m31.712s sys 0m1.168s 3 real 0m36.171s user 0m33.592s sys 0m1.383s 4 real 0m48.991s user 0m45.913s sys 0m1.296s 5 real 0m47.157s user 0m45.902s sys 0m1.175s 6 real 1m3.652s user 1m0.297s sys 0m1.332s 7 real 1m10.472s user 1m8.945s sys 0m1.268s 8 real 1m28.818s user 1m27.520s sys 0m1.128s 9 real 1m54.291s user 1m52.666s sys 0m1.244s > ll total 34151264 drwxr-xr-x 2 archivebot archivebot 4096 Feb 21 03:42 . drwxr-xr-x 20 archivebot archivebot 4096 Feb 21 02:55 .. -rw-r--r-- 1 archivebot archivebot 3641670876 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log -rw-r--r-- 1 archivebot archivebot 506536391 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz1 -rw-r--r-- 1 archivebot archivebot 493880878 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz2 -rw-r--r-- 1 archivebot archivebot 483203714 Feb 21 03:34 3pwf0useacbmua9uwp4idpale.log.gz3 -rw-r--r-- 1 archivebot archivebot 452770760 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz4 -rw-r--r-- 1 archivebot archivebot 436844810 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz5 -rw-r--r-- 1 archivebot archivebot 418611901 Feb 21 03:36 3pwf0useacbmua9uwp4idpale.log.gz6 -rw-r--r-- 1 archivebot archivebot 416090448 Feb 21 03:38 3pwf0useacbmua9uwp4idpale.log.gz7 -rw-r--r-- 1 archivebot archivebot 400631037 Feb 21 03:39 3pwf0useacbmua9uwp4idpale.log.gz8 -rw-r--r-- 1 archivebot archivebot 400425421 Feb 21 03:41 3pwf0useacbmua9uwp4idpale.log.gz9 -rw-r--r-- 1 archivebot archivebot 440404842 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst1 -rw-r--r-- 1 archivebot archivebot 365891660 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst10 -rw-r--r-- 1 archivebot archivebot 365068174 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst11 -rw-r--r-- 1 archivebot archivebot 363906198 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst12 -rw-r--r-- 1 archivebot archivebot 359998040 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst13 -rw-r--r-- 1 archivebot archivebot 358985335 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst14 -rw-r--r-- 1 archivebot archivebot 358212227 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst15 -rw-r--r-- 1 archivebot archivebot 435763309 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst2 -rw-r--r-- 1 archivebot archivebot 432647510 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst3 -rw-r--r-- 1 archivebot archivebot 433149771 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst4 -rw-r--r-- 1 archivebot archivebot 402242867 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst5 -rw-r--r-- 1 archivebot archivebot 395880291 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst6 -rw-r--r-- 1 archivebot archivebot 379345921 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst7 -rw-r--r-- 1 archivebot archivebot 369124857 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst8 -rw-r--r-- 1 archivebot archivebot 367090066 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst9 -rw-r--r-- 2 archivebot archivebot 2944745472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db -rw-r--r-- 1 archivebot archivebot 806176600 Feb 21 02:17 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz1 -rw-r--r-- 1 archivebot archivebot 800534883 Feb 21 02:18 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz2 -rw-r--r-- 1 archivebot archivebot 770088717 Feb 21 02:19 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz3 -rw-r--r-- 1 archivebot archivebot 736143418 Feb 21 02:20 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz4 -rw-r--r-- 1 archivebot archivebot 723571018 Feb 21 02:21 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz5 -rw-r--r-- 1 archivebot archivebot 717027291 Feb 21 02:22 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz6 -rw-r--r-- 1 archivebot archivebot 713746787 Feb 21 02:24 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz7 -rw-r--r-- 1 archivebot archivebot 711333243 Feb 21 02:26 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz8 -rw-r--r-- 1 archivebot archivebot 711214985 Feb 21 02:28 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz9 -rw-r--r-- 1 archivebot archivebot 721835831 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1 -rw-r--r-- 1 archivebot archivebot 613131472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10 -rw-r--r-- 1 archivebot archivebot 610937389 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11 -rw-r--r-- 1 archivebot archivebot 609634199 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12 -rw-r--r-- 1 archivebot archivebot 609777705 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13 -rw-r--r-- 1 archivebot archivebot 607311093 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14 -rw-r--r-- 1 archivebot archivebot 605756166 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15 -rw-r--r-- 1 archivebot archivebot 588934765 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16 -rw-r--r-- 1 archivebot archivebot 562606051 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17 -rw-r--r-- 1 archivebot archivebot 538896215 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18 -rw-r--r-- 1 archivebot archivebot 530637003 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19 -rw-r--r-- 1 archivebot archivebot 700663107 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2 -rw-r--r-- 1 archivebot archivebot 682199494 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3 -rw-r--r-- 1 archivebot archivebot 677610833 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4 -rw-r--r-- 1 archivebot archivebot 661601839 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5 -rw-r--r-- 1 archivebot archivebot 657653273 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6 -rw-r--r-- 1 archivebot archivebot 630368182 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7 -rw-r--r-- 1 archivebot archivebot 625318048 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8 -rw-r--r-- 1 archivebot archivebot 622723235 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9 ```

My conclusion: the sweet spot with zstd seems to be 10 for databases and 8 for logs. Up to that, there is an acceptable increase in runtime with significant space savings. Beyond that, the large increase in compression time outweighs the relatively small size reduction. Unless someone yells at me, that's what I'll implement soon™.

Fun side note: even zstd -2 compresses better than gzip -9 – and at a 10 times shorter runtime!

JustAnotherArchivist commented 2 years ago

A complication is SQLite's Write-Ahead Log (which records changes to the DB that aren't merged into the main database file yet). When the DB gets closed, it gets merged, and only wpull.db remains (but is this guaranteed behaviour?). This is what happens on aborting, for example. But when wpull crashes, wpull.db-wal and wpull.db-shm remain. Merging explicitly is possible using sqlite3 wpull.db 'PRAGMA wal_checkpoint' (docs, possibly an argument would be better), but I'm not sure whether that always works. Perhaps there'd need to be a fallback to preserve all three files in case the wal_checkpoint fails to merge them together.

pabs3 commented 1 month ago

Might be worth dumping the SQLite databases to SQL and compressing that instead of compressing the raw binary database files.