NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.48k stars 13.67k forks source link

Investigate deduplication to reduce storage and transfer #89380

Open nh2 opened 4 years ago

nh2 commented 4 years ago

A fundamental issue with the way nix works is that updating a package with many dependencies will result in a mass rebuild, with subsequent cache.nixos.org storage cost, and for the users, mass download of many GB of packages.

This makes NixOS's data storage and transport requirements for updates much higher than for "mutable" Linux distributions (e.g. Debian) that can just ship a fix for an individual package. For example, a security fix to openssl.so might take 1 MB download on Debian, and 10 GB download on my NixOS system.

Block-based deduplication is a technique to split data into chunks, and to store chunks that appear in multiple files only once. Often, rolling hashes are used for thus purpose; this is also how data transfer is avoided in rsync.

The ZFS file system has a deduplication feature, but in Zfs dedup on /nix/store – Is it worth it? it was stated that it is not very effective for the nix store.

However, there are other programs that do deduplication, such as bup, Borg, Attic, which seem to work pretty well in my first experiments (see next post).

This issue is to record measurements of effectiveness of deduplication for nix, and perhaps lead towards the implementation of deduplication to solve the fundamental issue.

nh2 commented 4 years ago

I've tested bup 0.30 and got great results, e.g. deduplicating 4 Chromium builds to the size of 1.

Input data: 4 large chromium builds of same and different versions:

$ du -sh /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
352M    /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149
352M    /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138
352M    /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138
354M    /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61

Deduplication with bup index + bup save (which saves the individual files inside the packages; a bit more overhead for bup because it has more paths to handle):

$ nix-shell -p bup
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test bup init
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test bup index /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test bup save -n chromium /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
$ du -sh $PWD/tmp/nix-store-bup-dedup-test
352M    /home/niklas/tmp/nix-store-bup-dedup-test

Deduplication with tar ... | bup split (which saves whole packages as tars):

$ nix-shell -p bup
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test-tar bup init
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test-tar sh -c 'tar c /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61 | bup split -n chromium'
$ du -sh $PWD/tmp/nix-store-bup-dedup-test-tar
356M    /home/niklas/tmp/nix-store-bup-dedup-test-tar

This first test suggests that deduplication could be very effective.

Mic92 commented 4 years ago

There is also work going on to support ipfs with nix to make downloads more efficient: https://discourse.nixos.org/t/obsidian-systems-is-excited-to-bring-ipfs-support-to-nix/7375

nh2 commented 4 years ago

I've done some more benchmarking, deduplicating my laptop's current nix store into bup using this script nix-store-bup-benchmark.py:

Total disk usage:  62.4 GiB  Apparent size:  58.7 GiB

$ command time python3 nix-store-bup.py
...
920.22user 205.48system 26:37.07elapsed 70%CPU (0avgtext+0avgdata 129068maxresident)k
238341556inputs+34980541outputs (528498major+21083523minor)pagefaults 0swaps

$ du -sh ~/tmp/nix-store-bup-dedup-test-tar
14G /home/niklas/tmp/nix-store-bup-dedup-test-tar

So indeed it looks like that with this approach I can store many NixOS generations at the size of around 1.

I also like that the max memory usage was 128 MB.

wamserma commented 4 years ago

Deduplication with tar ... | bup split (which saves whole packages as tars):

Deduplication of the NAR serialisation would be more reasonable, as it is the native format for Nix and might even be better dedupable. For more efficient distribution something like zsync should be investigated, preferably with added support for xz and zstd. Then nix could find a store path by stripping off the hash an matching the package name, compute the diffs based on the zsync-info and download only what is required.

The other issue I have with the tools you mention: They are designed for deduplication during backup. This is much different from deduping the local nix store, where you want to have the store paths accessible. For distribution these tools would need a lot of coordination between client and server.

nh2 commented 4 years ago

@wamserma

Deduplication of the NAR serialisation would be more reasonable, as it is the native format for Nix and might even be better dedupable.

I used tar as an approximation of NAR, just because I don't know how to make a .nar locally -- I don't expect there to be any difference in deduplication efficiency between .tar and .nar.

Why might NAR be better dedupable?

zsync

Yes, the main difference between zsync and the git-style content-addressable deduplication tools is that with zsync, one needs to identify a "matching" local file to compare against. Bup and similar tools do not need that (they can fetch just the missing blocks). In turn zsync requires a simpler protocol (essentially HTTP Range requests with keepalive) instead of a git clone-like protocol.

So the zsync approach might require less server load (though I am not sure of that becuase you can do git clone also over normal HTTPS; I'm not sure how expensive that is in comparison).

The other issue I have with the tools you mention: They are designed for deduplication during backup. This is much different from deduping the local nix store, where you want to have the store paths accessible. For distribution these tools would need a lot of coordination between client and server.

I don't quite understand this point. How does Why is "designed for deduplication during backup" different from deduping the local nix store?

wamserma commented 4 years ago

I used tar as an approximation of NAR, just because I don't know how to make a .nar locally -- I don't expect there to be any difference in deduplication efficiency between .tar and .nar.

nix-store --export /nix/store/fbd9sr5lx1r5r9w3cl4d5p5hgwbhk9jj-hello-2.10 > hello.nar

Why might NAR be better dedupable?

iirc NAR does not store time stamps, so diffs will be smaller and it has a defined ordering of files while tar stores files in the order they are passed.

Yes, the main difference between zsync and the git-style content-addressable deduplication tools is that with zsync, one needs to identify a "matching" local file to compare against. Bup and similar tools do not need that (they can fetch just the missing blocks). In turn zsync requires a simpler protocol (essentially HTTP Range requests with keepalive) instead of a git clone-like protocol.

You still need a way to identify the locally available blocks and the missing blocks. Borgbackup does this by keeping track of all the blocks in the archive and everytime it would write a new block to the archive it compares whether this block is there. When using this method for distribution, you get something like BitTorrent or IPFS. See also https://github.com/NixOS/nix/issues/3260 for a similar discussion.

I don't quite understand this point. How does Why is "designed for deduplication during backup" different from deduping the local nix store?

For backups you have a backup once and then only store the diffs. Your goal is efficiency in storage and transfer. You accept some latency when retrieving files from the backup, because you don't do this often and you work with the current copy of the file on your disk. For the nix store, you might have different realizations of a single derivation and it is not trivial for the system to decide which one should be kept and which one should be deduplicated. But resolving block-based deduplication that sits on top of the file-layer is slow as it has to reconstruct the file on-the-fly, so this is a performance-critical decision. File-based deduplication, which does not carry this penalty, is already available by hard-linking identical files in the store. (nix-store --optimise)

btw: this discussion rather belongs to https://github.com/NixOS/nix/ than nixpkgs.

nh2 commented 4 years ago

Another measurement:

I have now deduped the 900 GB /nix/store of my static-haskell-nix-ci build server.

bup stores it in 95 GB, thus making a 10x reduction for that use case.

[root@hetzner:~]# BUP_DIR=bup-nix-store-test command time bup index --update --one-file-system /nix/store/ && BUP_DIR=bup-nix-store-test command time bup save --name=nix-store /nix/store/
Indexing: 29021189, done (629 paths/s).
4947.53user 1121.11system 12:48:19elapsed 13%CPU (0avgtext+0avgdata 9468616maxresident)k
146987816inputs+8297848outputs (46major+2549093minor)pagefaults 0swaps
Reading index: 29021189, done.
bloom: creating from 1 file (200000 objects).
bloom: adding 1 file (200000 objects).
Saving: 1.73% (15544884/900216921k, 525605/29021189 files) 39h14m 6000k/s
...
Saving: 100.00% (900216921/900216921k, 29021189/29021189 files), done.
bloom: adding 1 file (53509 objects).
30147.51user 3439.65system 44:44:57elapsed 20%CPU (0avgtext+0avgdata 12469576maxresident)k
2106132576inputs+252400608outputs (454817major+27708318minor)pagefaults 0swaps

It took 44 hours (thus around 6 MB/s, as shown). Note this is backup of the all individual files inside all store paths, not per-package archives (so, not .nar or .tar).

wamserma commented 4 years ago

I have now deduped the 900 GB /nix/store of my static-haskell-nix-ci build server.

Not exactly. You made a space-efficient backup. To really dedupe, you have to delete the 900GB of data in /nix/store/ and mount your copy via sudo bup fuse -o /nix/store. This might actually work as long as you do not write to the store (or put another filesystem overlay to catch the writes).

nh2 commented 4 years ago

The ZFS file system has a deduplication feature, but in Zfs dedup on /nix/store – Is it worth it? it was stated that it is not very effective for the nix store.

For my memory, I now know why it doesn't work well with ZFS: It uses static chunking, which works well if you make a copy of a file and flip a few bits on it, but not if you make shift-style changes (e.g. inserting some bytes in the middle that shift all subsequent bytes to the right), and those are exactly the changes that occur if some functions are added/removed from .o files or binaries. Content-defined chunking schemes like bup can handle shifts. See https://serverfault.com/questions/302584/how-does-zfs-block-level-deduplication-fit-with-variable-block-size/317651#317651

In https://en.wikipedia.org/wiki/Rabin_fingerprint it also has a reference to _Muthitacharoen, Chen, Mazières: "A Low-bandwidth Network File System": (LBFS, from 2001), which is this deduplication approach implemented as an alternative to caching NFS mounts (but the implementation is abandoned now).

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

nh2 commented 3 years ago

The next step here I've planned is to make a proof of concept for a bup-caching nix substituter via FUSE.

cpitclaudel commented 3 years ago

@nh2 For reference, https://github.com/opendedup/sdfs claims to be a file system with deduplication based on content-dependent chunking.

RonnyPfannschmidt commented 3 years ago

perhaps it would be beneficial to introduce a number of compounding utilities

a) a structure that enables making locally relative references for globally unique items (so the actual hash of a dependency is no longer represented in a binary directly b) a git pack alike replacements for nar c) a thin delta packs that allow to pull the differences between 2 "git pack nars" effectively

that way all software that pulls updates would be able to ask a server for the contents of a archive but subtracting the content of a prior known archive

and known upgrade paths could be cached based on cost

RonnyPfannschmidt commented 3 years ago

also if a git style object names are used, a cache of existing files could potentially and effectively be used when unpacking pack style archives to limit file-system operations (but adding the need for a new type of gc)

L-as commented 3 years ago

I honestly think this issue is a non-problem, since it will be solved when IPFS is supported natively AFAICT.

Mic92 commented 3 years ago

Is ipfs support still actively worked on? It looks to me as if content-address derivations are the way to go now, also they seem a bit unstable still.

L-as commented 3 years ago

How would IPFS support work without a content-addressable store?

L-as commented 3 years ago

in https://discourse.nixos.org/t/obsidian-systems-is-excited-to-bring-ipfs-support-to-nix/7375 it says RFC 62 is required for it to work, so I assume this is the main blocker. I suppose @Ericson2314 can answer more accurately about the status of IPFS support.

nixos-discourse commented 2 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/github-elfshaker-elfshaker-elfshaker-stores-binary-objects-efficiently/16159/3

nh2 commented 2 years ago

Small update with some collected info:

New related projects:

These projects so far only help with reducing transfers, not with making the nix store stored on your disk smaller.

For that, there is another interesting prospect:

flokli commented 2 years ago

@nh2 nix-casync was meant as an easy POC to see if there's potential space gains by using chunking and dedup. These could be leveraged both in-transit as well as when storing on disk - for example if that protocol gets integrated into Nix (nix could use reflink copies to realize all chunks that are part of a store path). Another idea was to provide /nix/store as a fuse filesystem - and the necessary plumbing to stil allow booting /could/ be provided in-initrd.

I kinda got busy with various other things, so couldn't really continue working on that work, but I want to get back to it. I see you already joined the Matrix channel, I'll make sure to post updates there :-)

Note nix-casync currently uses desync under the hood to do the chunking - however the potential performance improvements don't really apply to our usecase, introduce a lot of complexity, and we might end up with something much simpler.

wamserma commented 2 years ago

I think sth. exploiting the structure of NAR files for intelligent chunking would be beneficial for our use case. Maybe the ideas behind zsync could be adopted to zstd, then a client could create a NAR of store/hash1-xyz on the fly and zsync to obtain the necessary chunks for the compressed NAR of store/hash2-xyz. Aside from the compression (zstd vs. xz) this would be fully backward compatible. In many cases (especially mass-rebuilds) the diffs will be very small, mostly store paths, so a more customized protocol should probably first normalize NARs by extracting store paths to ease diffing.

milahu commented 2 years ago

In many cases (especially mass-rebuilds) the diffs will be very small

future goal: also optimize the non-trivial cases see courgette → run diff/patch on decompiled binaries, to produce 10x smaller diffs

klarkc commented 1 year ago

I want to add my point of view here, I've been heavily using btrfs with bees with relative success for a year. This is the result of compsize on my nix folder:

[root@ssdinarch /]# compsize /nix/
Processed 3409979 files, 1107908 regular extents (1967786 refs), 1853709 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       83%       52G          63G          76G
none       100%       49G          49G          57G
zlib        22%      3.0G          13G          18G

In my scenario, bees is managed by systemd, so I can limit the resource usage and - because bees does not care about the filesystem (it works at the block level) - it's almost impossible to have an issue in a immutable structure like nix.

I would recommend to integrate nix with btrfs instead of creating a custom application-layer solution, maybe nix would have some intermediary layer to specify the dedup solution to be used in the host, and with that settings in place - optimize the data allocation process.

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/introducing-flox-nix-for-simplicity-and-scale/11275/26

blaggacao commented 1 year ago

I would recommend to integrate nix with btrfs instead of creating a custom application-layer solution

The specifics of the use case have exploitable context (see some of the comments above).

This is what makes an app-layer solution based on widely used standards / libraries appealing.

nh2 commented 1 year ago

Some more measurements I did with bup and bupstash across Chromium versions in NixOS 22.11 and NixOS unstable.

Jump to Summary at the end of this post to skip over the approach details.

chromium in NixOS 22.11

Hydra for package chromium, x86_64-linux, https://hydra.nixos.org/job/nixos/release-22.11/nixpkgs.chromium.x86_64-linux/all

EVAL        FINISHED AT PACKAGE NAME              UNWRAPPED STORE PATH
221320506   2023-05-24  chromium-113.0.5672.126   /nix/store/1kpj76401m3fp9hjmjjy9pr7gcwfn354-chromium-unwrapped-113.0.5672.126
220907185   2023-05-23  chromium-113.0.5672.126   /nix/store/zwkx9a47kyrsi8i6wcq76aghzaj383db-chromium-unwrapped-113.0.5672.126
220090267   2023-05-15  chromium-113.0.5672.92    /nix/store/9vcxzj2s4crijsjjagqq5msrlxfjvhyz-chromium-unwrapped-113.0.5672.92
219663965   2023-05-13  chromium-113.0.5672.92    /nix/store/cxi1is71zvdzsz0czswi4xk7yjs0k7jv-chromium-unwrapped-113.0.5672.92
217001551   2023-04-22  chromium-112.0.5615.165   /nix/store/y2l3pcxx34fdx1fc0a2wiiz7w59pfp8a-chromium-unwrapped-112.0.5615.165
216435731   2023-04-18  chromium-112.0.5615.121   /nix/store/45wxpdpfcnm8snk10j1dcjzgr5yacdiy-chromium-unwrapped-112.0.5615.121
216336158   2023-04-16  chromium-112.0.5615.121   /nix/store/rwzqyzvywnidja9p1y2clgpwsj9jy45p-chromium-unwrapped-112.0.5615.121
215397899   2023-04-09  chromium-112.0.5615.49    /nix/store/b6nnz0jbcr6ssvfw4104bv0jri4pxs5g-chromium-unwrapped-112.0.5615.49
214664244   2023-04-03  chromium-111.0.5563.146   /nix/store/x47zjqfkskvdypwfsfii6fvlm1f2m95c-chromium-unwrapped-111.0.5563.146
214619570   2023-04-01  chromium-111.0.5563.146   /nix/store/dmn7valpzl824icqvm79cqqs9s0x97s2-chromium-unwrapped-111.0.5563.146
213593038   2023-03-24  chromium-111.0.5563.110   /nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110
212071067   2023-03-12  chromium-111.0.5563.64    /nix/store/6h94l4a3xlm9jxfpjq762kxchijjj1h0-chromium-unwrapped-111.0.5563.64
211810299   2023-03-09  chromium-111.0.5563.64    /nix/store/58d7rc14xvswi44cmjrabmqs4v12sq1g-chromium-unwrapped-111.0.5563.64
211097888   2023-03-01  chromium-110.0.5481.177   /nix/store/b892d7p3avmmzj1nvm7cx8yq7akphara-chromium-unwrapped-110.0.5481.177
210870862   2023-02-27  chromium-110.0.5481.177   /nix/store/nvgyqqgaba3rp4bfxm5z0i75ma3cajpi-chromium-unwrapped-110.0.5481.177
210527728   2023-02-26  chromium-110.0.5481.177   /nix/store/m1kf457zpvrq4pnl2ys5d3r5wwy5qg4l-chromium-unwrapped-110.0.5481.177
209904431   2023-02-18  chromium-110.0.5481.100   /nix/store/h06w44dvd25g348m4ms6bw1zirnsga5n-chromium-unwrapped-110.0.5481.100
208849960   2023-02-11  chromium-110.0.5481.77    /nix/store/75zb0sjwc8j3grfdmxflwr6l844ksls5-chromium-unwrapped-110.0.5481.77
208592085   2023-02-09  chromium-110.0.5481.77    /nix/store/aqn31m10a6m2p0jgp1acl73hd4rxhzsj-chromium-unwrapped-110.0.5481.77
207332131   2023-01-30  chromium-109.0.5414.119   /nix/store/hwyv8azq2lx3lhlr10r43gm50qv3dkl6-chromium-unwrapped-109.0.5414.119
206498037   2023-01-23  chromium-109.0.5414.74    /nix/store/5rxbmxbsj7pbp5l3fcnkf84a9gfxidc6-chromium-unwrapped-109.0.5414.74
205781964   2023-01-16  chromium-109.0.5414.74    /nix/store/2fh20imsydcz5sla4nkvajnbcp0qgxvy-chromium-unwrapped-109.0.5414.74
203518943   2022-12-31  chromium-108.0.5359.124   /nix/store/71spchrw7nrpqhkgmpx0vl4jsi8zrzii-chromium-unwrapped-108.0.5359.124
202409897   2022-12-18  chromium-108.0.5359.124   /nix/store/y6gd2vahn7nm7jwlsyl7j26p7a88djcv-chromium-unwrapped-108.0.5359.124
201572441   2022-12-10  chromium-108.0.5359.98    /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
201141066   2022-12-09  chromium-108.0.5359.98    /nix/store/nq2g91pahhdvyw99kb18s9dh3csqg9my-chromium-unwrapped-108.0.5359.98
200758732   2022-12-05  chromium-108.0.5359.94    /nix/store/b2zqw6dmhryxzrdpgwa1a7v7mm03np2y-chromium-unwrapped-108.0.5359.94
200433324   2022-12-02  chromium-108.0.5359.71    /nix/store/xw3wm8p39dgws9falgwyhis5y3gpgx9w-chromium-unwrapped-108.0.5359.71
200014230   2022-11-26  chromium-107.0.5304.121   /nix/store/ljd6dfjmf6xryiki5vvywvf8kipc1j95-chromium-unwrapped-107.0.5304.121
199646451   2022-11-22  chromium-107.0.5304.110   /nix/store/lxg2x13cc4729sjicwqyjlf81a4wg1bq-chromium-unwrapped-107.0.5304.110

Downloading data

UNWRAPPED STORE PATH obtained via:

for EVAL in 221320506 220907185 220090267 219663965 217001551 216435731 216336158 215397899 214664244 214619570 213593038 212071067 211810299 211097888 210870862 210527728 209904431 208849960 208592085 207332131 206498037 205781964 203518943 202409897 201572441 201141066 200758732 200433324 200014230 199646451 ; do curl --silent --show-error "https://hydra.nixos.org/build/${EVAL}" | grep -oP 'nix-env \-i \K/nix/store/[^ ]*' | xargs nix-store -r | xargs nix-store -q --references | grep '\-chromium-unwrapped-' ; done | tee chromium-nixos-22.11-store-paths.txt

Total un-deduplicated size

du -sh --total $(tac chromium-nixos-22.11-store-paths.txt) | tail -n1
15G total

They are ~500 MB per Chromium store path.

bupstash deduplication

rm -f test-bupstash.key
bupstash new-key -o test-bupstash.key
export BUPSTASH_KEY="$PWD"/test-bupstash.key
export BUPSTASH_REPOSITORY="$PWD"/bupstash-repo-chromium-nixos-22.11
rm -rf "$BUPSTASH_REPOSITORY"
bupstash init

du -sh "$BUPSTASH_REPOSITORY"  # outputs 36K

for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && bupstash put --quiet --compression none --one-file-system --no-send-log --no-stat-caching "$STORE_PATH" >/dev/null && du --summarize --bytes "$BUPSTASH_REPOSITORY" | awk '{print $1}'; done
107.0.5304.110 484019615
107.0.5304.121 787128379
108.0.5359.71 1270913618
108.0.5359.94 1566248128
108.0.5359.98 1863148860
108.0.5359.98 1924551697
108.0.5359.124 2332125916
108.0.5359.124 2623872398
109.0.5414.74 3114048419
109.0.5414.74 3380817578
109.0.5414.119 3795381874
110.0.5481.77 4295815300
110.0.5481.77 4306718994
110.0.5481.100 4645588323
110.0.5481.177 5063521768
110.0.5481.177 5074423974
110.0.5481.177 5129307511
111.0.5563.64 5631616647
111.0.5563.64 5674969765
111.0.5563.110 6057202716
111.0.5563.146 6375567627
111.0.5563.146 6426129810
112.0.5615.49 6924892667
112.0.5615.121 7241788176
112.0.5615.121 7298936795
112.0.5615.165 7711706129
113.0.5672.92 8221796608
113.0.5672.92 8280717079
113.0.5672.126 8720658522
113.0.5672.126 8789724019

Thus deduplication is approximately 1.7x across these versions.

bupstash after tar

This alternative approach with tar works worse:

```sh for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && bupstash put --quiet --compression none --one-file-system --no-send-log --no-stat-caching "$(rm -f store_path.tar && tar cf store_path.tar "$STORE_PATH" 2>/dev/null && echo store_path.tar)" >/dev/null && du --summarize --bytes "$BUPSTASH_REPOSITORY" | awk '{print $1}'; done ``` ``` 107.0.5304.110 484274119 107.0.5304.121 860557763 108.0.5359.71 1351230639 108.0.5359.94 1735229241 108.0.5359.98 2123417809 108.0.5359.98 2276682082 108.0.5359.124 2726578016 108.0.5359.124 3107953806 109.0.5414.74 3602681678 109.0.5414.74 3957494883 109.0.5414.119 4372398817 110.0.5481.77 4873179596 110.0.5481.77 4978873125 110.0.5481.100 5386648173 110.0.5481.177 5814731726 110.0.5481.177 5923743104 110.0.5481.177 6076262963 111.0.5563.64 6578405952 111.0.5563.64 6724111574 111.0.5563.110 7148805173 111.0.5563.146 7532470892 111.0.5563.146 7678883747 112.0.5615.49 8180069455 112.0.5615.121 8583968614 112.0.5615.121 8736801517 112.0.5615.165 9159754102 113.0.5672.92 9671148452 113.0.5672.92 9818311449 113.0.5672.126 10265349354 113.0.5672.126 10411171777 ```

bup deduplication

export BUP_DIR="$PWD"/bup-repo-chromium-nixos-22.11
rm -rf "$BUP_DIR"
bup init

du -sh "$BUP_DIR"  # outputs 116K

for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && tar c "$STORE_PATH" 2>/dev/null | bup split -n nix-store-test --compress=0 >/dev/null 2>&1 && du --summarize --bytes "$BUP_DIR" | awk '{print $1}'; done

Increase of storage in the repo as store paths are added:

chromium_version bytes
107.0.5304.110 441750078
107.0.5304.121 685270380
108.0.5359.71 1095690597
108.0.5359.94 1345403134
108.0.5359.98 1594381397
108.0.5359.98 1620867236
108.0.5359.124 1880086884
108.0.5359.124 2131768355
109.0.5414.74 2541344295
109.0.5414.74 2667260649
109.0.5414.119 2926639615
110.0.5481.77 3352275058
110.0.5481.77 3355630601
110.0.5481.100 3619163466
110.0.5481.177 3872886786
110.0.5481.177 3876287588
110.0.5481.177 3899019390
111.0.5563.64 4328554327
111.0.5563.64 4345452047
111.0.5563.110 4605146981
111.0.5563.146 4854094549
111.0.5563.146 4884327306
112.0.5615.49 5295108347
112.0.5615.121 5557346598
112.0.5615.121 5579700357
112.0.5615.165 5856387829
113.0.5672.92 6281218528
113.0.5672.92 6303201999
113.0.5672.126 6584262667
113.0.5672.126 6617168563

Thus deduplication is approximately 2.2x across these versions.

chromium in nixpkgs nixos-unstable

From https://hydra.nixos.org/job/nixos/trunk-combined/nixpkgs.chromium.x86_64-linux/all

Evals:

``` EVAL FINISHED AT PACKAGE NAME 220925991 2023-05-22 chromium-113.0.5672.126 220896099 2023-05-21 chromium-113.0.5672.126 220590787 2023-05-19 chromium-113.0.5672.92 219725752 2023-05-14 chromium-113.0.5672.92 219645873 2023-05-12 chromium-113.0.5672.92 219455927 2023-05-11 chromium-113.0.5672.63 218897447 2023-05-07 chromium-113.0.5672.63 218661888 2023-05-05 chromium-113.0.5672.63 218627045 2023-05-04 chromium-112.0.5615.165 218386419 2023-05-02 chromium-112.0.5615.165 218313404 2023-05-02 chromium-112.0.5615.165 217852950 2023-04-29 chromium-112.0.5615.165 217790492 2023-04-28 chromium-112.0.5615.165 217257833 2023-04-26 chromium-112.0.5615.165 217239363 2023-04-25 chromium-112.0.5615.165 217072305 2023-04-22 chromium-112.0.5615.165 216812497 2023-04-20 chromium-112.0.5615.121 216137674 2023-04-16 chromium-112.0.5615.121 216106324 2023-04-15 chromium-112.0.5615.49 215797576 2023-04-12 chromium-112.0.5615.49 215431158 2023-04-10 chromium-112.0.5615.49 215106405 2023-04-06 chromium-112.0.5615.49 214581493 2023-04-01 chromium-111.0.5563.146 214545396 2023-03-31 chromium-111.0.5563.110 214124322 2023-03-27 chromium-111.0.5563.110 213577676 2023-03-24 chromium-111.0.5563.110 213111967 2023-03-18 chromium-111.0.5563.64 212544972 2023-03-15 chromium-111.0.5563.64 212035533 2023-03-12 chromium-111.0.5563.64 211902820 2023-03-10 chromium-111.0.5563.64 211822970 2023-03-09 chromium-111.0.5563.64 210936569 2023-02-28 chromium-110.0.5481.177 210638474 2023-02-26 chromium-110.0.5481.177 210521234 2023-02-25 chromium-110.0.5481.177 210476660 2023-02-24 chromium-110.0.5481.100 210244590 2023-02-22 chromium-110.0.5481.100 210078702 2023-02-20 chromium-110.0.5481.100 209904775 2023-02-18 chromium-110.0.5481.100 209096703 2023-02-14 chromium-110.0.5481.77 208754251 2023-02-11 chromium-110.0.5481.77 208647055 2023-02-10 chromium-110.0.5481.77 208212441 2023-02-06 chromium-109.0.5414.119 208071743 2023-02-04 chromium-109.0.5414.119 207903681 2023-02-02 chromium-109.0.5414.119 207629629 2023-02-01 chromium-109.0.5414.119 207354337 2023-01-30 chromium-109.0.5414.119 207306398 2023-01-29 chromium-109.0.5414.74 206741739 2023-01-23 chromium-109.0.5414.74 206670921 2023-01-22 chromium-109.0.5414.74 206113784 2023-01-20 chromium-109.0.5414.74 206057215 2023-01-18 chromium-109.0.5414.74 205413915 2023-01-15 chromium-109.0.5414.74 205160811 2023-01-12 chromium-109.0.5414.74 ```
for EVAL in 220925991 220896099 220590787 219725752 219645873 219455927 218897447 218661888 218627045 218386419 218313404 217852950 217790492 217257833 217239363 217072305 216812497 216137674 216106324 215797576 215431158 215106405 214581493 214545396 214124322 213577676 213111967 212544972 212035533 211902820 211822970 210936569 210638474 210521234 210476660 210244590 210078702 209904775 209096703 208754251 208647055 208212441 208071743 207903681 207629629 207354337 207306398 206741739 206670921 206113784 206057215 205413915 205160811 ; do curl --silent --show-error "https://hydra.nixos.org/build/${EVAL}" | grep -oP 'nix-env \-i \K/nix/store/[^ ]*' | xargs nix-store -r | xargs nix-store -q --references | grep '\-chromium-unwrapped-' ; done | tee chromium-nixos-unstable-store-paths.txt

Total un-deduplicated size

du -sh --total $(tac chromium-nixos-unstable-store-paths.txt) | tail -n1
21G total

bupstash deduplication

rm -f test-bupstash.key
bupstash new-key -o test-bupstash.key
export BUPSTASH_KEY="$PWD"/test-bupstash.key
export BUPSTASH_REPOSITORY="$PWD"/bupstash-repo-chromium-nixos-unstable
rm -rf "$BUPSTASH_REPOSITORY"
bupstash init

du -sh "$BUPSTASH_REPOSITORY"  # outputs 36K

for STORE_PATH in $(tac chromium-nixos-unstable-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && bupstash put --quiet --compression none --one-file-system --no-send-log --no-stat-caching "$STORE_PATH" >/dev/null && du --summarize --bytes "$BUPSTASH_REPOSITORY" | awk '{print $1}'; done

Increase of storage in the repo as store paths are added:

chromium_version bytes
109.0.5414.74 498834355
109.0.5414.74 793889744
109.0.5414.74 810197194
109.0.5414.74 810197941
109.0.5414.74 826505459
109.0.5414.74 842816970
109.0.5414.74 842817717
109.0.5414.119 1242706533
109.0.5414.119 1551108167
109.0.5414.119 1573996556
109.0.5414.119 1596885005
109.0.5414.119 1596885754
110.0.5481.77 2095088959
110.0.5481.77 2114778698
110.0.5481.77 2407160262
110.0.5481.100 2733766671
110.0.5481.100 2733767420
110.0.5481.100 2733768169
110.0.5481.100 2753456965
110.0.5481.177 3174455607
110.0.5481.177 3237014950
110.0.5481.177 3256704758
111.0.5563.64 3727127368
111.0.5563.64 3727128115
111.0.5563.64 3740787785
111.0.5563.64 3795220524
111.0.5563.64 3808876041
111.0.5563.110 4174161924
111.0.5563.110 4231314019
111.0.5563.110 4244968313
111.0.5563.146 4551552172
112.0.5615.49 5057055692
112.0.5615.49 5117815988
112.0.5615.49 5460886535
112.0.5615.49 5522069119
112.0.5615.121 5845117339
112.0.5615.121 5909050260
112.0.5615.165 6328833223
112.0.5615.165 6328833972
112.0.5615.165 6394082677
112.0.5615.165 6394083426
112.0.5615.165 6394084175
112.0.5615.165 6463673523
112.0.5615.165 6522845837
112.0.5615.165 6583555994
113.0.5672.63 7066034457
113.0.5672.63 7129787462
113.0.5672.63 7182393244
113.0.5672.92 7490763847
113.0.5672.92 7552217590
113.0.5672.92 7848660925
113.0.5672.126 8253254494
113.0.5672.126 8311722272

Thus deduplication is approximately 2.5x across these versions.

bup deduplication

export BUP_DIR="$PWD"/bup-repo-chromium-nixos-unstable
rm -rf "$BUP_DIR"
bup init

du -sh "$BUP_DIR"  # outputs 116K

for STORE_PATH in $(tac chromium-nixos-unstable-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && tar c "$STORE_PATH" 2>/dev/null | bup split -n nix-store-test --compress=0 >/dev/null 2>&1 && du --summarize --bytes "$BUP_DIR" | awk '{print $1}'; done

Increase of storage in the repo as store paths are added:

chromium_version bytes
109.0.5414.74 455943866
109.0.5414.74 582077940
109.0.5414.74 584775561
109.0.5414.74 584777304
109.0.5414.74 587507814
109.0.5414.74 590608665
109.0.5414.74 590610408
109.0.5414.119 850115816
109.0.5414.119 1097177582
109.0.5414.119 1101769540
109.0.5414.119 1104163176
109.0.5414.119 1104164919
110.0.5481.77 1529893790
110.0.5481.77 1535993595
110.0.5481.77 1662808451
110.0.5481.100 1919843876
110.0.5481.100 1919845619
110.0.5481.100 1922798743
110.0.5481.100 1924270640
110.0.5481.177 2181484333
110.0.5481.177 2204413085
110.0.5481.177 2211407783
111.0.5563.64 2636385616
111.0.5563.64 2636387359
111.0.5563.64 2639809138
111.0.5563.64 2663030610
111.0.5563.64 2663492618
111.0.5563.110 2925926894
111.0.5563.110 2948201672
111.0.5563.110 2956103696
111.0.5563.146 3205812357
112.0.5615.49 3625869143
112.0.5615.49 3647726425
112.0.5615.49 3929617788
112.0.5615.49 3946711297
112.0.5615.121 4210777747
112.0.5615.121 4232855984
112.0.5615.165 4508268031
112.0.5615.165 4503788801
112.0.5615.165 4526017732
112.0.5615.165 4526019475
112.0.5615.165 4532310719
112.0.5615.165 4548371054
112.0.5615.165 4569679935
112.0.5615.165 4591495585
113.0.5672.63 5029253278
113.0.5672.63 5046697365
113.0.5672.63 5069699810
113.0.5672.92 5331826842
113.0.5672.92 5363390981
113.0.5672.92 5605690143
113.0.5672.126 5886216971
113.0.5672.126 5909759957

Thus deduplication is approximately 3.5x across these versions.

Summary

A reduction of 3.5x seems quite easily achievable across large store paths like Chromium.

The factor increases the more similar the builds are. Thus nixos-unstable has a better factor than nixos-22.11; for staging it will probably increase even further.

bupstash is currently deduplicating less efficiently than bup, but in turn runs at 500 MB/s.

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/the-nixos-foundations-call-to-action-s3-costs-require-community-support/28672/79

nh2 commented 1 year ago

I did the same test as in https://github.com/NixOS/nixpkgs/issues/89380#issuecomment-1575550831 with Attic:

chromium in NixOS 22.11

attic deduplication

I used attic 0.1.0 (release), and configured ~/.config/attic/server.toml to have compression type = "none" instead of the default zstd, so that it is comparable with the other benchmarks.

# Follow https://docs.attic.rs to set up a local attic server
attic cache create niklas-attic-test

for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && attic push niklas-attic-test --ignore-upstream-cache-filter --no-closure "$STORE_PATH" && du --summarize --bytes /root/.local/share/attic/storage | awk '{print $1}'; done

Output (manually partitioned into my counting output, and Attic's own information):

107.0.5304.110 441289856
107.0.5304.121 690218042
108.0.5359.71 1128498612
108.0.5359.94 1383074540
108.0.5359.98 1638851697
108.0.5359.98 1662627756
108.0.5359.124 1942823675
108.0.5359.124 2197970188
109.0.5414.74 2637505216
109.0.5414.74 2828837217
109.0.5414.119 3109001447
110.0.5481.77 3558650217
110.0.5481.77 3559329457
110.0.5481.100 3822488974
110.0.5481.177 4098408851
110.0.5481.177 4099088173
110.0.5481.177 4122128801
111.0.5563.64 4567938198
111.0.5563.64 4590126966
111.0.5563.110 4865767564
111.0.5563.146 5122600229
111.0.5563.146 5144889457
112.0.5615.49 5591444922
112.0.5615.121 5859168704
112.0.5615.121 5883245734
112.0.5615.165 6170293945
113.0.5672.92 6629312579
113.0.5672.92 6653340901
113.0.5672.126 6957538525
113.0.5672.126 6981382165

107.0.5304.110   (33.03 MiB/s, 8.8% deduplicated)
107.0.5304.121   (41.54 MiB/s, 48.6% deduplicated)
108.0.5359.71    (31.40 MiB/s, 11.4% deduplicated)
108.0.5359.94    (40.47 MiB/s, 48.5% deduplicated)
108.0.5359.98    (38.97 MiB/s, 48.3% deduplicated)
108.0.5359.98    (61.60 MiB/s, 95.2% deduplicated)
108.0.5359.124   (40.42 MiB/s, 43.4% deduplicated)
108.0.5359.124   (41.83 MiB/s, 48.4% deduplicated)
109.0.5414.74    (32.04 MiB/s, 12.0% deduplicated)
109.0.5414.74    (44.61 MiB/s, 61.7% deduplicated)
109.0.5414.119   (38.26 MiB/s, 43.9% deduplicated)
110.0.5481.77    (32.41 MiB/s, 10.1% deduplicated)
110.0.5481.77    (71.22 MiB/s, 99.9% deduplicated)
110.0.5481.100   (41.75 MiB/s, 47.4% deduplicated)
110.0.5481.177   (40.16 MiB/s, 44.9% deduplicated)
110.0.5481.177   (65.85 MiB/s, 99.9% deduplicated)
110.0.5481.177   (62.85 MiB/s, 95.4% deduplicated)
111.0.5563.64    (32.72 MiB/s, 11.6% deduplicated)
111.0.5563.64    (64.05 MiB/s, 95.6% deduplicated)
111.0.5563.110   (39.95 MiB/s, 45.3% deduplicated)
111.0.5563.146   (41.69 MiB/s, 49.1% deduplicated)
111.0.5563.146   (68.72 MiB/s, 95.6% deduplicated)
112.0.5615.49    (33.32 MiB/s, 12.3% deduplicated)
112.0.5615.121   (41.98 MiB/s, 47.4% deduplicated)
112.0.5615.121   (63.99 MiB/s, 95.3% deduplicated)
112.0.5615.165   (39.84 MiB/s, 43.6% deduplicated)
113.0.5672.92    (34.99 MiB/s, 12.1% deduplicated)
113.0.5672.92    (67.90 MiB/s, 95.4% deduplicated)
113.0.5672.126   (42.18 MiB/s, 41.7% deduplicated)
113.0.5672.126   (68.86 MiB/s, 95.4% deduplicated)

Counting files in the deduplication repository:

# find ~/.local/share/attic/storage -type f | wc -l
99845

Summarising:

Thus deduplication is approximately 2.14x across these versions, at ~40 MB/s.


Redoing the same benchmark without --no-closure, I get a similar reduction of factor of 2.19x:

# du -sh --total $(nix-store -qR $(cat chromium-nixos-22.11-store-paths.txt)) | tail -n1
18G total

# du -sh ~/.local/share/attic/storage                                              
8.2G    /root/.local/share/attic/storage

I found this bug while testing Attic this way: https://github.com/zhaofengli/attic/issues/61

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/introducing-attic-a-self-hostable-nix-binary-cache-server/24343/43