Open flokli opened 2 years ago
@rickynils provided some promising numbers at https://discourse.nixos.org/t/nix-casync-a-more-efficient-way-to-store-and-substitute-nix-store-paths/16539/3.
https://github.com/flokli/nix-casync/pull/35 brings configurable chunking size (and some more concurrency), so maybe we can re-run this with some slightly different chunk sizes?
@flokli I can re-run the nix-casync ingestion on the same data set I used before to see if any storage gains can be made by using different chunk sizes. It probably takes a couple of days running it, so I can't test with a large number of chunk sizes. What sizes would you like me to try out?
@flokli Btw, you asked for the distribution of chunk sizes. I've compiled a csv file on this format:
13875,40468,/rpool/casync-test/castr/0004/0004f2c61180e83eb965b349927d6e08bba6f0b6a595118502732a52a7d52512.cacnk
36751,79176,/rpool/casync-test/castr/0004/0004203db6082841ca81798bef756a01e65220b8a2d4b62bc8b7f648603d6af7.cacnk
9943,99748,/rpool/casync-test/castr/0004/000412d9ce0935f8cf074dcf676550512548f4874e4039bf9ec8649b7a6365d2.cacnk
18596,160993,/rpool/casync-test/castr/0004/000410c681b2016db11262973734d3981bea5ecabdafbd947d9fbafccdfca73c.cacnk
7240,25549,/rpool/casync-test/castr/0004/000454ea2beeca574f6227405cabc972f266c80f13461584b797121fb172061e.cacnk
The first column is the compressed chunk size (in bytes), second column uncompressed. There might be minor rounding errors in the byte counts since these numbers was derived from floating-point kilobyte numbers outputted by zstd --list
.
In my post on discourse I stated that the sum of the compressed chunks was 1223264 MB, but if you sum up the sizes from my csv file you actually get 1044006 MB (15% less). This is because the number in the discourse post includes disk/fs overhead (block alignment etc), but the numbers in the csv file are the "raw" chunk sizes.
The complete csv file is 1.4 GB large zstd-compressed. I haven't done any analysis on it, but I've pushed it to my Cachix cache named rickynils
. You can fetch it by doing:
nix-store -r /nix/store/2pr5achd242cna6qfk086qy0ffxgsyv2-cacnk-sizes.csv.zstd
Thanks! Let's do some small analysis before trying with another chunking sizes.
Some things that'd be good to know:
If someone beats me with producing this report (ideally in a script that can be easily run against other similar experiments too), I wouldn't mind ;-)
Looking at the total sum of uncompressed vs compressed (raw bytes) chunk sizes, compression ratio lands on 2.38
. This is roughly in line with the ZFS zstd compression of 2.69
on the same data set. I don't know if the zstd settings used by default in nix-casync vs ZFS differs.
The chunks in the csv-file was produced by nix-casync
revision 25eb0e59e23fd580187cab3c8e7860d9c0044e0c
A friend of mine not very active on GitHub made a quick analysis:
We should play with casync chunk size parameters, and see how well it deduplicates.
We should also play with CPU utilization and parallelity.