Closed codesections closed 1 year ago
One solution to this would be for CompUnit::Repository
to name files based on a content hash instead of a hash of their long name. This would prevent duplicate copies of the same file from being installed, since identical files would, by definition, have identical content hashes.
This would, however, complicate uninstalling a distribution since it would no longer be the case that every file that the distribution provides could be removed without impacting other distributions. To solve this issue, we would need to track the number of distributions that provide a particular file and only remove the file when that number reaches 0 (i.e., reference counting).
Doing so would be fairly easy for Zef, but might raise more challenges for package managers such as apt
. But I believe that the space savings/speed gain would be worth it.
I would guess the main issue would be performance. If we switch to hashing the entire contents of the file then when CURI.id
is called it'll have to hash a lot more data as well as open each of those files. Currently it just basically does a sha1 of [~] $dist-dir.dir
, which hashes a lot less data (just file names which are already sha1s) and uses a lot less file ops. Someone could probably do a benchmark to see what the performance different is with like 100 distributions installed (it might not be as relevant these days), but they'll want to note the initial start-up time when using a module (a lot of the optimizations are specifically aimed at improving start up time) in particular.
edit: actually I'm not sure the CURI.id
part is relevant. What would be relevant is that CURI
looks up the file by the sha1 of the long name. When someone does use Foo
, CURI
essentially looks for sha1("Foo")
. But if files were saved by their contents hash then use Foo
no longer has a way to find the file.
fwiw at one point I considered doing this as well, but came to feel that if someone really wanted file deduplication they could use e.g. FUZE or some other file system mechanism (which would also do it better).
This is mostly an implementation question invisible on language level, right? When doing dependency resolution, the version comes only from the distribution so for all intents and purposes, the Foo::Bar
of version 0.0.2 is an independent piece of code from the Foo::Bar
of version 0.0.1. The user shouldn't notice the deduplication, besides less storage being used. In which case, I tend to agree with ugexe that it's best to do this in a language-agnostic way, with the least visible impact on the whole process.
Honestly, if you are so starved on disk space that you worry about source code files, that would indicate full file system deduplication and compression, not a very targeted solution like the suggested one.
That said, there is some merit to the idea of avoiding useless precompilation if we have already precompiled an identical file and immutability of installed distributions should allow for us to do that. However in practice, --force-install does exist, so that would open another can of problems.
I don't think we'd have to worry too much about performance as SHA was designed to be really fast. My desktop hashes some 890 MB/s on a single core. You won't notice it with source code files. That is actually the reason why I went ahead with CURFS just hashing the full lib dir back then. That only started to cause problems when people accidentally pointed RAKULIB at their home directories :)
I'm sure we could do this and gain some benefit in a specific situation. I highly doubt whether that's the best use of our very limited resources though.
Thanks for the replies. Based on this feedback, this seems like a low priority; given that, I'm going to close this issue for now. If anyone would like to champion a solution here, please feel free to re-open.
Consider a distribution
Foo
whereFoo:ver<0.0.1>
that providesFoo::Bar
andFoo::Baz
andFoo:ver<0.0.2>
provides a modifiedFoo::Bar
but an identicalFoo::Baz
. Given the way thatCompUnit::Repository
s locate files, there should in principle, be no reason why two identical copies ofFoo::Baz
need to be stored in the userssource
directory. Currently, however, identical copies are stored, which can result in significant space being occupied by duplicate files. Finding a way to avoid installing multiple copies of the same file might also allow Raku to skip recompiling some files, though I'm less sure of this due to potentially-different compile-time environments.I noticed this when looking into #386, and was somewhat surprised. I had thought that Raku distributions would avoid installing identical files, as is done in the the similar-in-spirit system used by Guix, which deduplicates at the file level, leading to significant space savings.