Benchmark / Performance Testing

flokli commented 2 years ago

We should play with casync chunk size parameters, and see how well it deduplicates.

We should also play with CPU utilization and parallelity.

flokli commented 2 years ago

@rickynils provided some promising numbers at https://discourse.nixos.org/t/nix-casync-a-more-efficient-way-to-store-and-substitute-nix-store-paths/16539/3.

https://github.com/flokli/nix-casync/pull/35 brings configurable chunking size (and some more concurrency), so maybe we can re-run this with some slightly different chunk sizes?

rickynils commented 2 years ago

@flokli I can re-run the nix-casync ingestion on the same data set I used before to see if any storage gains can be made by using different chunk sizes. It probably takes a couple of days running it, so I can't test with a large number of chunk sizes. What sizes would you like me to try out?

rickynils commented 2 years ago

@flokli Btw, you asked for the distribution of chunk sizes. I've compiled a csv file on this format:

13875,40468,/rpool/casync-test/castr/0004/0004f2c61180e83eb965b349927d6e08bba6f0b6a595118502732a52a7d52512.cacnk
36751,79176,/rpool/casync-test/castr/0004/0004203db6082841ca81798bef756a01e65220b8a2d4b62bc8b7f648603d6af7.cacnk
9943,99748,/rpool/casync-test/castr/0004/000412d9ce0935f8cf074dcf676550512548f4874e4039bf9ec8649b7a6365d2.cacnk
18596,160993,/rpool/casync-test/castr/0004/000410c681b2016db11262973734d3981bea5ecabdafbd947d9fbafccdfca73c.cacnk
7240,25549,/rpool/casync-test/castr/0004/000454ea2beeca574f6227405cabc972f266c80f13461584b797121fb172061e.cacnk

The first column is the compressed chunk size (in bytes), second column uncompressed. There might be minor rounding errors in the byte counts since these numbers was derived from floating-point kilobyte numbers outputted by zstd --list.

In my post on discourse I stated that the sum of the compressed chunks was 1223264 MB, but if you sum up the sizes from my csv file you actually get 1044006 MB (15% less). This is because the number in the discourse post includes disk/fs overhead (block alignment etc), but the numbers in the csv file are the "raw" chunk sizes.

The complete csv file is 1.4 GB large zstd-compressed. I haven't done any analysis on it, but I've pushed it to my Cachix cache named rickynils. You can fetch it by doing:

nix-store -r /nix/store/2pr5achd242cna6qfk086qy0ffxgsyv2-cacnk-sizes.csv.zstd

flokli commented 2 years ago

Thanks! Let's do some small analysis before trying with another chunking sizes.

Some things that'd be good to know:

Some histogram on the distribution of the (uncompressed) chunk sizes. Do we often end up with the maximum chunk size, minimum chunk size? That'd give some insight in how we want to tune those numbers. It might be interesting to compare that distribution with other chunking sizes (but let's wait until we saw that data)
How effective is the compression? (comparison of compressed/uncompressed chunk size). We might want to compress chunks more agressively
An analysis of the "bucketing" folder structure. Right now, everything that shares the same 4 characters gets put in the same folder. How many elements did end up in every of those directories, how would it look like if we would use only the first two for example?

If someone beats me with producing this report (ideally in a script that can be easily run against other similar experiments too), I wouldn't mind ;-)

rickynils commented 2 years ago

Looking at the total sum of uncompressed vs compressed (raw bytes) chunk sizes, compression ratio lands on 2.38. This is roughly in line with the ZFS zstd compression of 2.69 on the same data set. I don't know if the zstd settings used by default in nix-casync vs ZFS differs.

rickynils commented 2 years ago

The chunks in the csv-file was produced by nix-casync revision 25eb0e59e23fd580187cab3c8e7860d9c0044e0c

flokli commented 2 years ago

A friend of mine not very active on GitHub made a quick analysis:

nix-casync chunk size analysis.pdf

flokli / nix-casync

Benchmark / Performance Testing #2