Open nh2 opened 4 years ago
I've tested bup 0.30
and got great results, e.g. deduplicating 4 Chromium builds to the size of 1.
Input data: 4 large chromium
builds of same and different versions:
$ du -sh /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
352M /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149
352M /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138
352M /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138
354M /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
Deduplication with bup index
+ bup save
(which saves the individual files inside the packages; a bit more overhead for bup because it has more paths to handle):
$ nix-shell -p bup
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test bup init
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test bup index /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test bup save -n chromium /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61
$ du -sh $PWD/tmp/nix-store-bup-dedup-test
352M /home/niklas/tmp/nix-store-bup-dedup-test
Deduplication with tar ... | bup split
(which saves whole packages as tar
s):
$ nix-shell -p bup
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test-tar bup init
$ BUP_DIR=$PWD/tmp/nix-store-bup-dedup-test-tar sh -c 'tar c /nix/store/1a0ai4xgq8b0z6k9qwxsg2y4372kd19g-chromium-unwrapped-80.0.3987.149 /nix/store/8dv9k2h2vwak92ynic387ysmv3i95a82-chromium-unwrapped-81.0.4044.138 /nix/store/8rbmv9cm66xq66k6j7ydbys2kx589h0q-chromium-unwrapped-81.0.4044.138 /nix/store/n05dzhs61x4k82dlzv2m7vz1pnlgfzda-chromium-unwrapped-83.0.4103.61 | bup split -n chromium'
$ du -sh $PWD/tmp/nix-store-bup-dedup-test-tar
356M /home/niklas/tmp/nix-store-bup-dedup-test-tar
This first test suggests that deduplication could be very effective.
There is also work going on to support ipfs with nix to make downloads more efficient: https://discourse.nixos.org/t/obsidian-systems-is-excited-to-bring-ipfs-support-to-nix/7375
I've done some more benchmarking, deduplicating my laptop's current nix store into bup using this script nix-store-bup-benchmark.py
:
Total disk usage: 62.4 GiB Apparent size: 58.7 GiB
$ command time python3 nix-store-bup.py
...
920.22user 205.48system 26:37.07elapsed 70%CPU (0avgtext+0avgdata 129068maxresident)k
238341556inputs+34980541outputs (528498major+21083523minor)pagefaults 0swaps
$ du -sh ~/tmp/nix-store-bup-dedup-test-tar
14G /home/niklas/tmp/nix-store-bup-dedup-test-tar
So indeed it looks like that with this approach I can store many NixOS generations at the size of around 1.
I also like that the max memory usage was 128 MB.
Deduplication with
tar ... | bup split
(which saves whole packages astar
s):
Deduplication of the NAR serialisation would be more reasonable, as it is the native format for Nix and might even be better dedupable. For more efficient distribution something like zsync should be investigated, preferably with added support for xz and zstd. Then nix could find a store path by stripping off the hash an matching the package name, compute the diffs based on the zsync-info and download only what is required.
The other issue I have with the tools you mention: They are designed for deduplication during backup. This is much different from deduping the local nix store, where you want to have the store paths accessible. For distribution these tools would need a lot of coordination between client and server.
@wamserma
Deduplication of the NAR serialisation would be more reasonable, as it is the native format for Nix and might even be better dedupable.
I used tar as an approximation of NAR, just because I don't know how to make a .nar
locally -- I don't expect there to be any difference in deduplication efficiency between .tar and .nar.
Why might NAR be better dedupable?
zsync
Yes, the main difference between zsync and the git-style content-addressable deduplication tools is that with zsync, one needs to identify a "matching" local file to compare against. Bup and similar tools do not need that (they can fetch just the missing blocks). In turn zsync
requires a simpler protocol (essentially HTTP Range requests with keepalive) instead of a git clone
-like protocol.
So the zsync approach might require less server load (though I am not sure of that becuase you can do git clone
also over normal HTTPS; I'm not sure how expensive that is in comparison).
The other issue I have with the tools you mention: They are designed for deduplication during backup. This is much different from deduping the local nix store, where you want to have the store paths accessible. For distribution these tools would need a lot of coordination between client and server.
I don't quite understand this point. How does Why is "designed for deduplication during backup" different from deduping the local nix store?
I used tar as an approximation of NAR, just because I don't know how to make a
.nar
locally -- I don't expect there to be any difference in deduplication efficiency between .tar and .nar.
nix-store --export /nix/store/fbd9sr5lx1r5r9w3cl4d5p5hgwbhk9jj-hello-2.10 > hello.nar
Why might NAR be better dedupable?
iirc NAR does not store time stamps, so diffs will be smaller and it has a defined ordering of files while tar stores files in the order they are passed.
Yes, the main difference between zsync and the git-style content-addressable deduplication tools is that with zsync, one needs to identify a "matching" local file to compare against. Bup and similar tools do not need that (they can fetch just the missing blocks). In turn
zsync
requires a simpler protocol (essentially HTTP Range requests with keepalive) instead of agit clone
-like protocol.
You still need a way to identify the locally available blocks and the missing blocks. Borgbackup does this by keeping track of all the blocks in the archive and everytime it would write a new block to the archive it compares whether this block is there. When using this method for distribution, you get something like BitTorrent or IPFS. See also https://github.com/NixOS/nix/issues/3260 for a similar discussion.
I don't quite understand this point. How does Why is "designed for deduplication during backup" different from deduping the local nix store?
For backups you have a backup once and then only store the diffs. Your goal is efficiency in storage and transfer. You accept some latency when retrieving files from the backup, because you don't do this often and you work with the current copy of the file on your disk.
For the nix store, you might have different realizations of a single derivation and it is not trivial for the system to decide which one should be kept and which one should be deduplicated. But resolving block-based deduplication that sits on top of the file-layer is slow as it has to reconstruct the file on-the-fly, so this is a performance-critical decision. File-based deduplication, which does not carry this penalty, is already available by hard-linking identical files in the store. (nix-store --optimise
)
btw: this discussion rather belongs to https://github.com/NixOS/nix/ than nixpkgs.
Another measurement:
I have now deduped the 900 GB /nix/store
of my static-haskell-nix-ci
build server.
bup stores it in 95 GB, thus making a 10x reduction for that use case.
[root@hetzner:~]# BUP_DIR=bup-nix-store-test command time bup index --update --one-file-system /nix/store/ && BUP_DIR=bup-nix-store-test command time bup save --name=nix-store /nix/store/
Indexing: 29021189, done (629 paths/s).
4947.53user 1121.11system 12:48:19elapsed 13%CPU (0avgtext+0avgdata 9468616maxresident)k
146987816inputs+8297848outputs (46major+2549093minor)pagefaults 0swaps
Reading index: 29021189, done.
bloom: creating from 1 file (200000 objects).
bloom: adding 1 file (200000 objects).
Saving: 1.73% (15544884/900216921k, 525605/29021189 files) 39h14m 6000k/s
...
Saving: 100.00% (900216921/900216921k, 29021189/29021189 files), done.
bloom: adding 1 file (53509 objects).
30147.51user 3439.65system 44:44:57elapsed 20%CPU (0avgtext+0avgdata 12469576maxresident)k
2106132576inputs+252400608outputs (454817major+27708318minor)pagefaults 0swaps
It took 44 hours (thus around 6 MB/s, as shown). Note this is backup of the all individual files inside all store paths, not per-package archives (so, not .nar or .tar).
I have now deduped the 900 GB
/nix/store
of mystatic-haskell-nix-ci
build server.
Not exactly. You made a space-efficient backup. To really dedupe, you have to delete the 900GB of data in /nix/store/
and mount your copy via sudo bup fuse -o /nix/store
. This might actually work as long as you do not write to the store (or put another filesystem overlay to catch the writes).
The ZFS file system has a deduplication feature, but in Zfs dedup on /nix/store – Is it worth it? it was stated that it is not very effective for the nix store.
For my memory, I now know why it doesn't work well with ZFS: It uses static chunking, which works well if you make a copy of a file and flip a few bits on it, but not if you make shift-style changes (e.g. inserting some bytes in the middle that shift all subsequent bytes to the right), and those are exactly the changes that occur if some functions are added/removed from .o
files or binaries. Content-defined chunking schemes like bup
can handle shifts. See https://serverfault.com/questions/302584/how-does-zfs-block-level-deduplication-fit-with-variable-block-size/317651#317651
In https://en.wikipedia.org/wiki/Rabin_fingerprint it also has a reference to _Muthitacharoen, Chen, Mazières: "A Low-bandwidth Network File System": (LBFS, from 2001), which is this deduplication approach implemented as an alternative to caching NFS mounts (but the implementation is abandoned now).
I marked this as stale due to inactivity. → More info
The next step here I've planned is to make a proof of concept for a bup-caching nix substituter via FUSE.
@nh2 For reference, https://github.com/opendedup/sdfs claims to be a file system with deduplication based on content-dependent chunking.
perhaps it would be beneficial to introduce a number of compounding utilities
a) a structure that enables making locally relative references for globally unique items (so the actual hash of a dependency is no longer represented in a binary directly b) a git pack alike replacements for nar c) a thin delta packs that allow to pull the differences between 2 "git pack nars" effectively
that way all software that pulls updates would be able to ask a server for the contents of a archive but subtracting the content of a prior known archive
and known upgrade paths could be cached based on cost
also if a git style object names are used, a cache of existing files could potentially and effectively be used when unpacking pack style archives to limit file-system operations (but adding the need for a new type of gc)
I honestly think this issue is a non-problem, since it will be solved when IPFS is supported natively AFAICT.
Is ipfs support still actively worked on? It looks to me as if content-address derivations are the way to go now, also they seem a bit unstable still.
How would IPFS support work without a content-addressable store?
in https://discourse.nixos.org/t/obsidian-systems-is-excited-to-bring-ipfs-support-to-nix/7375 it says RFC 62 is required for it to work, so I assume this is the main blocker. I suppose @Ericson2314 can answer more accurately about the status of IPFS support.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
Small update with some collected info:
New related projects:
nix-casync
is another project going into this direction (@flokli):
desync
, an alternative implementation of casync
with higher performance:
These projects so far only help with reducing transfers, not with making the nix store stored on your disk smaller.
For that, there is another interesting prospect:
mmap
are supportedreflinks
copy_file_range()
syscall can make enable reflink support even through mapper file systems like FUSE or networked file system, delegating copying and reflinking to the backend.cp --reflink
supportcopy_file_range
articles@nh2 nix-casync
was meant as an easy POC to see if there's potential space gains by using chunking and dedup. These could be leveraged both in-transit as well as when storing on disk - for example if that protocol gets integrated into Nix (nix could use reflink copies to realize all chunks that are part of a store path). Another idea was to provide /nix/store
as a fuse filesystem - and the necessary plumbing to stil allow booting /could/ be provided in-initrd.
I kinda got busy with various other things, so couldn't really continue working on that work, but I want to get back to it. I see you already joined the Matrix channel, I'll make sure to post updates there :-)
Note nix-casync
currently uses desync under the hood to do the chunking - however the potential performance improvements don't really apply to our usecase, introduce a lot of complexity, and we might end up with something much simpler.
I think sth. exploiting the structure of NAR files for intelligent chunking would be beneficial for our use case. Maybe the ideas behind zsync could be adopted to zstd, then a client could create a NAR of store/hash1-xyz
on the fly and zsync
to obtain the necessary chunks for the compressed NAR of store/hash2-xyz
. Aside from the compression (zstd vs. xz) this would be fully backward compatible.
In many cases (especially mass-rebuilds) the diffs will be very small, mostly store paths, so a more customized protocol should probably first normalize NARs by extracting store paths to ease diffing.
In many cases (especially mass-rebuilds) the diffs will be very small
future goal: also optimize the non-trivial cases see courgette → run diff/patch on decompiled binaries, to produce 10x smaller diffs
I want to add my point of view here, I've been heavily using btrfs with bees with relative success for a year. This is the result of compsize on my nix folder:
[root@ssdinarch /]# compsize /nix/
Processed 3409979 files, 1107908 regular extents (1967786 refs), 1853709 inline.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 83% 52G 63G 76G
none 100% 49G 49G 57G
zlib 22% 3.0G 13G 18G
In my scenario, bees is managed by systemd, so I can limit the resource usage and - because bees does not care about the filesystem (it works at the block level) - it's almost impossible to have an issue in a immutable structure like nix.
I would recommend to integrate nix with btrfs instead of creating a custom application-layer solution, maybe nix would have some intermediary layer to specify the dedup solution to be used in the host, and with that settings in place - optimize the data allocation process.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/introducing-flox-nix-for-simplicity-and-scale/11275/26
I would recommend to integrate nix with btrfs instead of creating a custom application-layer solution
The specifics of the use case have exploitable context (see some of the comments above).
This is what makes an app-layer solution based on widely used standards / libraries appealing.
Some more measurements I did with bup
and bupstash
across Chromium versions in NixOS 22.11 and NixOS unstable.
Jump to Summary
at the end of this post to skip over the approach details.
chromium
in NixOS 22.11Hydra for package chromium
, x86_64-linux
, https://hydra.nixos.org/job/nixos/release-22.11/nixpkgs.chromium.x86_64-linux/all
EVAL FINISHED AT PACKAGE NAME UNWRAPPED STORE PATH
221320506 2023-05-24 chromium-113.0.5672.126 /nix/store/1kpj76401m3fp9hjmjjy9pr7gcwfn354-chromium-unwrapped-113.0.5672.126
220907185 2023-05-23 chromium-113.0.5672.126 /nix/store/zwkx9a47kyrsi8i6wcq76aghzaj383db-chromium-unwrapped-113.0.5672.126
220090267 2023-05-15 chromium-113.0.5672.92 /nix/store/9vcxzj2s4crijsjjagqq5msrlxfjvhyz-chromium-unwrapped-113.0.5672.92
219663965 2023-05-13 chromium-113.0.5672.92 /nix/store/cxi1is71zvdzsz0czswi4xk7yjs0k7jv-chromium-unwrapped-113.0.5672.92
217001551 2023-04-22 chromium-112.0.5615.165 /nix/store/y2l3pcxx34fdx1fc0a2wiiz7w59pfp8a-chromium-unwrapped-112.0.5615.165
216435731 2023-04-18 chromium-112.0.5615.121 /nix/store/45wxpdpfcnm8snk10j1dcjzgr5yacdiy-chromium-unwrapped-112.0.5615.121
216336158 2023-04-16 chromium-112.0.5615.121 /nix/store/rwzqyzvywnidja9p1y2clgpwsj9jy45p-chromium-unwrapped-112.0.5615.121
215397899 2023-04-09 chromium-112.0.5615.49 /nix/store/b6nnz0jbcr6ssvfw4104bv0jri4pxs5g-chromium-unwrapped-112.0.5615.49
214664244 2023-04-03 chromium-111.0.5563.146 /nix/store/x47zjqfkskvdypwfsfii6fvlm1f2m95c-chromium-unwrapped-111.0.5563.146
214619570 2023-04-01 chromium-111.0.5563.146 /nix/store/dmn7valpzl824icqvm79cqqs9s0x97s2-chromium-unwrapped-111.0.5563.146
213593038 2023-03-24 chromium-111.0.5563.110 /nix/store/3gpnym3j1aiys0isd6vqvzhr7fg87bv5-chromium-unwrapped-111.0.5563.110
212071067 2023-03-12 chromium-111.0.5563.64 /nix/store/6h94l4a3xlm9jxfpjq762kxchijjj1h0-chromium-unwrapped-111.0.5563.64
211810299 2023-03-09 chromium-111.0.5563.64 /nix/store/58d7rc14xvswi44cmjrabmqs4v12sq1g-chromium-unwrapped-111.0.5563.64
211097888 2023-03-01 chromium-110.0.5481.177 /nix/store/b892d7p3avmmzj1nvm7cx8yq7akphara-chromium-unwrapped-110.0.5481.177
210870862 2023-02-27 chromium-110.0.5481.177 /nix/store/nvgyqqgaba3rp4bfxm5z0i75ma3cajpi-chromium-unwrapped-110.0.5481.177
210527728 2023-02-26 chromium-110.0.5481.177 /nix/store/m1kf457zpvrq4pnl2ys5d3r5wwy5qg4l-chromium-unwrapped-110.0.5481.177
209904431 2023-02-18 chromium-110.0.5481.100 /nix/store/h06w44dvd25g348m4ms6bw1zirnsga5n-chromium-unwrapped-110.0.5481.100
208849960 2023-02-11 chromium-110.0.5481.77 /nix/store/75zb0sjwc8j3grfdmxflwr6l844ksls5-chromium-unwrapped-110.0.5481.77
208592085 2023-02-09 chromium-110.0.5481.77 /nix/store/aqn31m10a6m2p0jgp1acl73hd4rxhzsj-chromium-unwrapped-110.0.5481.77
207332131 2023-01-30 chromium-109.0.5414.119 /nix/store/hwyv8azq2lx3lhlr10r43gm50qv3dkl6-chromium-unwrapped-109.0.5414.119
206498037 2023-01-23 chromium-109.0.5414.74 /nix/store/5rxbmxbsj7pbp5l3fcnkf84a9gfxidc6-chromium-unwrapped-109.0.5414.74
205781964 2023-01-16 chromium-109.0.5414.74 /nix/store/2fh20imsydcz5sla4nkvajnbcp0qgxvy-chromium-unwrapped-109.0.5414.74
203518943 2022-12-31 chromium-108.0.5359.124 /nix/store/71spchrw7nrpqhkgmpx0vl4jsi8zrzii-chromium-unwrapped-108.0.5359.124
202409897 2022-12-18 chromium-108.0.5359.124 /nix/store/y6gd2vahn7nm7jwlsyl7j26p7a88djcv-chromium-unwrapped-108.0.5359.124
201572441 2022-12-10 chromium-108.0.5359.98 /nix/store/620lqprbzy4pgd2x4zkg7n19rfd59ap7-chromium-unwrapped-108.0.5359.98
201141066 2022-12-09 chromium-108.0.5359.98 /nix/store/nq2g91pahhdvyw99kb18s9dh3csqg9my-chromium-unwrapped-108.0.5359.98
200758732 2022-12-05 chromium-108.0.5359.94 /nix/store/b2zqw6dmhryxzrdpgwa1a7v7mm03np2y-chromium-unwrapped-108.0.5359.94
200433324 2022-12-02 chromium-108.0.5359.71 /nix/store/xw3wm8p39dgws9falgwyhis5y3gpgx9w-chromium-unwrapped-108.0.5359.71
200014230 2022-11-26 chromium-107.0.5304.121 /nix/store/ljd6dfjmf6xryiki5vvywvf8kipc1j95-chromium-unwrapped-107.0.5304.121
199646451 2022-11-22 chromium-107.0.5304.110 /nix/store/lxg2x13cc4729sjicwqyjlf81a4wg1bq-chromium-unwrapped-107.0.5304.110
UNWRAPPED STORE PATH
obtained via:
for EVAL in 221320506 220907185 220090267 219663965 217001551 216435731 216336158 215397899 214664244 214619570 213593038 212071067 211810299 211097888 210870862 210527728 209904431 208849960 208592085 207332131 206498037 205781964 203518943 202409897 201572441 201141066 200758732 200433324 200014230 199646451 ; do curl --silent --show-error "https://hydra.nixos.org/build/${EVAL}" | grep -oP 'nix-env \-i \K/nix/store/[^ ]*' | xargs nix-store -r | xargs nix-store -q --references | grep '\-chromium-unwrapped-' ; done | tee chromium-nixos-22.11-store-paths.txt
du -sh --total $(tac chromium-nixos-22.11-store-paths.txt) | tail -n1
15G total
They are ~500 MB per Chromium store path.
bupstash
deduplicationrm -f test-bupstash.key
bupstash new-key -o test-bupstash.key
export BUPSTASH_KEY="$PWD"/test-bupstash.key
export BUPSTASH_REPOSITORY="$PWD"/bupstash-repo-chromium-nixos-22.11
rm -rf "$BUPSTASH_REPOSITORY"
bupstash init
du -sh "$BUPSTASH_REPOSITORY" # outputs 36K
for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && bupstash put --quiet --compression none --one-file-system --no-send-log --no-stat-caching "$STORE_PATH" >/dev/null && du --summarize --bytes "$BUPSTASH_REPOSITORY" | awk '{print $1}'; done
107.0.5304.110 484019615
107.0.5304.121 787128379
108.0.5359.71 1270913618
108.0.5359.94 1566248128
108.0.5359.98 1863148860
108.0.5359.98 1924551697
108.0.5359.124 2332125916
108.0.5359.124 2623872398
109.0.5414.74 3114048419
109.0.5414.74 3380817578
109.0.5414.119 3795381874
110.0.5481.77 4295815300
110.0.5481.77 4306718994
110.0.5481.100 4645588323
110.0.5481.177 5063521768
110.0.5481.177 5074423974
110.0.5481.177 5129307511
111.0.5563.64 5631616647
111.0.5563.64 5674969765
111.0.5563.110 6057202716
111.0.5563.146 6375567627
111.0.5563.146 6426129810
112.0.5615.49 6924892667
112.0.5615.121 7241788176
112.0.5615.121 7298936795
112.0.5615.165 7711706129
113.0.5672.92 8221796608
113.0.5672.92 8280717079
113.0.5672.126 8720658522
113.0.5672.126 8789724019
Thus deduplication is approximately 1.7x across these versions.
bupstash
after tar
This alternative approach with tar
works worse:
bup
deduplicationexport BUP_DIR="$PWD"/bup-repo-chromium-nixos-22.11
rm -rf "$BUP_DIR"
bup init
du -sh "$BUP_DIR" # outputs 116K
for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && tar c "$STORE_PATH" 2>/dev/null | bup split -n nix-store-test --compress=0 >/dev/null 2>&1 && du --summarize --bytes "$BUP_DIR" | awk '{print $1}'; done
Increase of storage in the repo as store paths are added:
chromium_version bytes
107.0.5304.110 441750078
107.0.5304.121 685270380
108.0.5359.71 1095690597
108.0.5359.94 1345403134
108.0.5359.98 1594381397
108.0.5359.98 1620867236
108.0.5359.124 1880086884
108.0.5359.124 2131768355
109.0.5414.74 2541344295
109.0.5414.74 2667260649
109.0.5414.119 2926639615
110.0.5481.77 3352275058
110.0.5481.77 3355630601
110.0.5481.100 3619163466
110.0.5481.177 3872886786
110.0.5481.177 3876287588
110.0.5481.177 3899019390
111.0.5563.64 4328554327
111.0.5563.64 4345452047
111.0.5563.110 4605146981
111.0.5563.146 4854094549
111.0.5563.146 4884327306
112.0.5615.49 5295108347
112.0.5615.121 5557346598
112.0.5615.121 5579700357
112.0.5615.165 5856387829
113.0.5672.92 6281218528
113.0.5672.92 6303201999
113.0.5672.126 6584262667
113.0.5672.126 6617168563
Thus deduplication is approximately 2.2x across these versions.
chromium
in nixpkgs nixos-unstable
From https://hydra.nixos.org/job/nixos/trunk-combined/nixpkgs.chromium.x86_64-linux/all
Evals:
for EVAL in 220925991 220896099 220590787 219725752 219645873 219455927 218897447 218661888 218627045 218386419 218313404 217852950 217790492 217257833 217239363 217072305 216812497 216137674 216106324 215797576 215431158 215106405 214581493 214545396 214124322 213577676 213111967 212544972 212035533 211902820 211822970 210936569 210638474 210521234 210476660 210244590 210078702 209904775 209096703 208754251 208647055 208212441 208071743 207903681 207629629 207354337 207306398 206741739 206670921 206113784 206057215 205413915 205160811 ; do curl --silent --show-error "https://hydra.nixos.org/build/${EVAL}" | grep -oP 'nix-env \-i \K/nix/store/[^ ]*' | xargs nix-store -r | xargs nix-store -q --references | grep '\-chromium-unwrapped-' ; done | tee chromium-nixos-unstable-store-paths.txt
du -sh --total $(tac chromium-nixos-unstable-store-paths.txt) | tail -n1
21G total
bupstash
deduplicationrm -f test-bupstash.key
bupstash new-key -o test-bupstash.key
export BUPSTASH_KEY="$PWD"/test-bupstash.key
export BUPSTASH_REPOSITORY="$PWD"/bupstash-repo-chromium-nixos-unstable
rm -rf "$BUPSTASH_REPOSITORY"
bupstash init
du -sh "$BUPSTASH_REPOSITORY" # outputs 36K
for STORE_PATH in $(tac chromium-nixos-unstable-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && bupstash put --quiet --compression none --one-file-system --no-send-log --no-stat-caching "$STORE_PATH" >/dev/null && du --summarize --bytes "$BUPSTASH_REPOSITORY" | awk '{print $1}'; done
Increase of storage in the repo as store paths are added:
chromium_version bytes
109.0.5414.74 498834355
109.0.5414.74 793889744
109.0.5414.74 810197194
109.0.5414.74 810197941
109.0.5414.74 826505459
109.0.5414.74 842816970
109.0.5414.74 842817717
109.0.5414.119 1242706533
109.0.5414.119 1551108167
109.0.5414.119 1573996556
109.0.5414.119 1596885005
109.0.5414.119 1596885754
110.0.5481.77 2095088959
110.0.5481.77 2114778698
110.0.5481.77 2407160262
110.0.5481.100 2733766671
110.0.5481.100 2733767420
110.0.5481.100 2733768169
110.0.5481.100 2753456965
110.0.5481.177 3174455607
110.0.5481.177 3237014950
110.0.5481.177 3256704758
111.0.5563.64 3727127368
111.0.5563.64 3727128115
111.0.5563.64 3740787785
111.0.5563.64 3795220524
111.0.5563.64 3808876041
111.0.5563.110 4174161924
111.0.5563.110 4231314019
111.0.5563.110 4244968313
111.0.5563.146 4551552172
112.0.5615.49 5057055692
112.0.5615.49 5117815988
112.0.5615.49 5460886535
112.0.5615.49 5522069119
112.0.5615.121 5845117339
112.0.5615.121 5909050260
112.0.5615.165 6328833223
112.0.5615.165 6328833972
112.0.5615.165 6394082677
112.0.5615.165 6394083426
112.0.5615.165 6394084175
112.0.5615.165 6463673523
112.0.5615.165 6522845837
112.0.5615.165 6583555994
113.0.5672.63 7066034457
113.0.5672.63 7129787462
113.0.5672.63 7182393244
113.0.5672.92 7490763847
113.0.5672.92 7552217590
113.0.5672.92 7848660925
113.0.5672.126 8253254494
113.0.5672.126 8311722272
Thus deduplication is approximately 2.5x across these versions.
bup
deduplicationexport BUP_DIR="$PWD"/bup-repo-chromium-nixos-unstable
rm -rf "$BUP_DIR"
bup init
du -sh "$BUP_DIR" # outputs 116K
for STORE_PATH in $(tac chromium-nixos-unstable-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && tar c "$STORE_PATH" 2>/dev/null | bup split -n nix-store-test --compress=0 >/dev/null 2>&1 && du --summarize --bytes "$BUP_DIR" | awk '{print $1}'; done
Increase of storage in the repo as store paths are added:
chromium_version bytes
109.0.5414.74 455943866
109.0.5414.74 582077940
109.0.5414.74 584775561
109.0.5414.74 584777304
109.0.5414.74 587507814
109.0.5414.74 590608665
109.0.5414.74 590610408
109.0.5414.119 850115816
109.0.5414.119 1097177582
109.0.5414.119 1101769540
109.0.5414.119 1104163176
109.0.5414.119 1104164919
110.0.5481.77 1529893790
110.0.5481.77 1535993595
110.0.5481.77 1662808451
110.0.5481.100 1919843876
110.0.5481.100 1919845619
110.0.5481.100 1922798743
110.0.5481.100 1924270640
110.0.5481.177 2181484333
110.0.5481.177 2204413085
110.0.5481.177 2211407783
111.0.5563.64 2636385616
111.0.5563.64 2636387359
111.0.5563.64 2639809138
111.0.5563.64 2663030610
111.0.5563.64 2663492618
111.0.5563.110 2925926894
111.0.5563.110 2948201672
111.0.5563.110 2956103696
111.0.5563.146 3205812357
112.0.5615.49 3625869143
112.0.5615.49 3647726425
112.0.5615.49 3929617788
112.0.5615.49 3946711297
112.0.5615.121 4210777747
112.0.5615.121 4232855984
112.0.5615.165 4508268031
112.0.5615.165 4503788801
112.0.5615.165 4526017732
112.0.5615.165 4526019475
112.0.5615.165 4532310719
112.0.5615.165 4548371054
112.0.5615.165 4569679935
112.0.5615.165 4591495585
113.0.5672.63 5029253278
113.0.5672.63 5046697365
113.0.5672.63 5069699810
113.0.5672.92 5331826842
113.0.5672.92 5363390981
113.0.5672.92 5605690143
113.0.5672.126 5886216971
113.0.5672.126 5909759957
Thus deduplication is approximately 3.5x across these versions.
A reduction of 3.5x seems quite easily achievable across large store paths like Chromium.
The factor increases the more similar the builds are. Thus nixos-unstable
has a better factor than nixos-22.11
; for staging
it will probably increase even further.
bupstash
is currently deduplicating less efficiently than bup
, but in turn runs at 500 MB/s.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
I did the same test as in https://github.com/NixOS/nixpkgs/issues/89380#issuecomment-1575550831 with Attic:
chromium
in NixOS 22.11attic
deduplicationI used attic 0.1.0 (release)
, and configured ~/.config/attic/server.toml
to have compression type = "none"
instead of the default zstd, so that it is comparable with the other benchmarks.
# Follow https://docs.attic.rs to set up a local attic server
attic cache create niklas-attic-test
for STORE_PATH in $(tac chromium-nixos-22.11-store-paths.txt); do echo "$STORE_PATH" | grep -oP '/nix/store/.*-chromium-unwrapped-\K.*' | tr '\n' ' ' && attic push niklas-attic-test --ignore-upstream-cache-filter --no-closure "$STORE_PATH" && du --summarize --bytes /root/.local/share/attic/storage | awk '{print $1}'; done
Output (manually partitioned into my counting output, and Attic's own information):
107.0.5304.110 441289856
107.0.5304.121 690218042
108.0.5359.71 1128498612
108.0.5359.94 1383074540
108.0.5359.98 1638851697
108.0.5359.98 1662627756
108.0.5359.124 1942823675
108.0.5359.124 2197970188
109.0.5414.74 2637505216
109.0.5414.74 2828837217
109.0.5414.119 3109001447
110.0.5481.77 3558650217
110.0.5481.77 3559329457
110.0.5481.100 3822488974
110.0.5481.177 4098408851
110.0.5481.177 4099088173
110.0.5481.177 4122128801
111.0.5563.64 4567938198
111.0.5563.64 4590126966
111.0.5563.110 4865767564
111.0.5563.146 5122600229
111.0.5563.146 5144889457
112.0.5615.49 5591444922
112.0.5615.121 5859168704
112.0.5615.121 5883245734
112.0.5615.165 6170293945
113.0.5672.92 6629312579
113.0.5672.92 6653340901
113.0.5672.126 6957538525
113.0.5672.126 6981382165
107.0.5304.110 (33.03 MiB/s, 8.8% deduplicated)
107.0.5304.121 (41.54 MiB/s, 48.6% deduplicated)
108.0.5359.71 (31.40 MiB/s, 11.4% deduplicated)
108.0.5359.94 (40.47 MiB/s, 48.5% deduplicated)
108.0.5359.98 (38.97 MiB/s, 48.3% deduplicated)
108.0.5359.98 (61.60 MiB/s, 95.2% deduplicated)
108.0.5359.124 (40.42 MiB/s, 43.4% deduplicated)
108.0.5359.124 (41.83 MiB/s, 48.4% deduplicated)
109.0.5414.74 (32.04 MiB/s, 12.0% deduplicated)
109.0.5414.74 (44.61 MiB/s, 61.7% deduplicated)
109.0.5414.119 (38.26 MiB/s, 43.9% deduplicated)
110.0.5481.77 (32.41 MiB/s, 10.1% deduplicated)
110.0.5481.77 (71.22 MiB/s, 99.9% deduplicated)
110.0.5481.100 (41.75 MiB/s, 47.4% deduplicated)
110.0.5481.177 (40.16 MiB/s, 44.9% deduplicated)
110.0.5481.177 (65.85 MiB/s, 99.9% deduplicated)
110.0.5481.177 (62.85 MiB/s, 95.4% deduplicated)
111.0.5563.64 (32.72 MiB/s, 11.6% deduplicated)
111.0.5563.64 (64.05 MiB/s, 95.6% deduplicated)
111.0.5563.110 (39.95 MiB/s, 45.3% deduplicated)
111.0.5563.146 (41.69 MiB/s, 49.1% deduplicated)
111.0.5563.146 (68.72 MiB/s, 95.6% deduplicated)
112.0.5615.49 (33.32 MiB/s, 12.3% deduplicated)
112.0.5615.121 (41.98 MiB/s, 47.4% deduplicated)
112.0.5615.121 (63.99 MiB/s, 95.3% deduplicated)
112.0.5615.165 (39.84 MiB/s, 43.6% deduplicated)
113.0.5672.92 (34.99 MiB/s, 12.1% deduplicated)
113.0.5672.92 (67.90 MiB/s, 95.4% deduplicated)
113.0.5672.126 (42.18 MiB/s, 41.7% deduplicated)
113.0.5672.126 (68.86 MiB/s, 95.4% deduplicated)
Counting files in the deduplication repository:
# find ~/.local/share/attic/storage -type f | wc -l
99845
Summarising:
Thus deduplication is approximately 2.14x across these versions, at ~40 MB/s.
Redoing the same benchmark without --no-closure
, I get a similar reduction of factor of 2.19x:
# du -sh --total $(nix-store -qR $(cat chromium-nixos-22.11-store-paths.txt)) | tail -n1
18G total
# du -sh ~/.local/share/attic/storage
8.2G /root/.local/share/attic/storage
I found this bug while testing Attic this way: https://github.com/zhaofengli/attic/issues/61
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/introducing-attic-a-self-hostable-nix-binary-cache-server/24343/43
A fundamental issue with the way nix works is that updating a package with many dependencies will result in a mass rebuild, with subsequent cache.nixos.org storage cost, and for the users, mass download of many GB of packages.
This makes NixOS's data storage and transport requirements for updates much higher than for "mutable" Linux distributions (e.g. Debian) that can just ship a fix for an individual package. For example, a security fix to
openssl.so
might take 1 MB download on Debian, and 10 GB download on my NixOS system.Block-based deduplication is a technique to split data into chunks, and to store chunks that appear in multiple files only once. Often, rolling hashes are used for thus purpose; this is also how data transfer is avoided in
rsync
.The ZFS file system has a deduplication feature, but in Zfs dedup on /nix/store – Is it worth it? it was stated that it is not very effective for the nix store.
However, there are other programs that do deduplication, such as
bup
, Borg, Attic, which seem to work pretty well in my first experiments (see next post).This issue is to record measurements of effectiveness of deduplication for nix, and perhaps lead towards the implementation of deduplication to solve the fundamental issue.