Installed Raku distributions don't de-duplicate

Raku / problem-solving

🦋 Problem Solving, a repo for handling problems that require review, deliberation and possibly debate

Artistic License 2.0

70 stars 16 forks source link

Installed Raku distributions don't de-duplicate #388

Closed codesections closed 1 year ago

codesections commented 1 year ago

Consider a distribution Foo where Foo:ver<0.0.1> that provides Foo::Bar and Foo::Baz and Foo:ver<0.0.2> provides a modified Foo::Bar but an identical Foo::Baz. Given the way that CompUnit::Repositorys locate files, there should in principle, be no reason why two identical copies of Foo::Baz need to be stored in the users source directory. Currently, however, identical copies are stored, which can result in significant space being occupied by duplicate files. Finding a way to avoid installing multiple copies of the same file might also allow Raku to skip recompiling some files, though I'm less sure of this due to potentially-different compile-time environments.

I noticed this when looking into #386, and was somewhat surprised. I had thought that Raku distributions would avoid installing identical files, as is done in the the similar-in-spirit system used by Guix, which deduplicates at the file level, leading to significant space savings.

codesections commented 1 year ago

One solution to this would be for CompUnit::Repository to name files based on a content hash instead of a hash of their long name. This would prevent duplicate copies of the same file from being installed, since identical files would, by definition, have identical content hashes.

This would, however, complicate uninstalling a distribution since it would no longer be the case that every file that the distribution provides could be removed without impacting other distributions. To solve this issue, we would need to track the number of distributions that provide a particular file and only remove the file when that number reaches 0 (i.e., reference counting).

Doing so would be fairly easy for Zef, but might raise more challenges for package managers such as apt. But I believe that the space savings/speed gain would be worth it.

ugexe commented 1 year ago

I would guess the main issue would be performance. If we switch to hashing the entire contents of the file then when CURI.id is called it'll have to hash a lot more data as well as open each of those files. Currently it just basically does a sha1 of [~] $dist-dir.dir, which hashes a lot less data (just file names which are already sha1s) and uses a lot less file ops. Someone could probably do a benchmark to see what the performance different is with like 100 distributions installed (it might not be as relevant these days), but they'll want to note the initial start-up time when using a module (a lot of the optimizations are specifically aimed at improving start up time) in particular.

edit: actually I'm not sure the CURI.id part is relevant. What would be relevant is that CURI looks up the file by the sha1 of the long name. When someone does use Foo, CURI essentially looks for sha1("Foo"). But if files were saved by their contents hash then use Foo no longer has a way to find the file.

ugexe commented 1 year ago

fwiw at one point I considered doing this as well, but came to feel that if someone really wanted file deduplication they could use e.g. FUZE or some other file system mechanism (which would also do it better).

2colours commented 1 year ago

This is mostly an implementation question invisible on language level, right? When doing dependency resolution, the version comes only from the distribution so for all intents and purposes, the Foo::Bar of version 0.0.2 is an independent piece of code from the Foo::Bar of version 0.0.1. The user shouldn't notice the deduplication, besides less storage being used. In which case, I tend to agree with ugexe that it's best to do this in a language-agnostic way, with the least visible impact on the whole process.

niner commented 1 year ago

Honestly, if you are so starved on disk space that you worry about source code files, that would indicate full file system deduplication and compression, not a very targeted solution like the suggested one.

That said, there is some merit to the idea of avoiding useless precompilation if we have already precompiled an identical file and immutability of installed distributions should allow for us to do that. However in practice, --force-install does exist, so that would open another can of problems.

I don't think we'd have to worry too much about performance as SHA was designed to be really fast. My desktop hashes some 890 MB/s on a single core. You won't notice it with source code files. That is actually the reason why I went ahead with CURFS just hashing the full lib dir back then. That only started to cause problems when people accidentally pointed RAKULIB at their home directories :)

I'm sure we could do this and gain some benefit in a specific situation. I highly doubt whether that's the best use of our very limited resources though.

codesections commented 1 year ago

Thanks for the replies. Based on this feedback, this seems like a low priority; given that, I'm going to close this issue for now. If anyone would like to champion a solution here, please feel free to re-open.