Deduplication via content addressable object store

hexylena commented 1 year ago

We've already got the infrastructure for storing datasets by UUID (and making the associated subdirectories). If, instead of uuid, we stored by sha256sum, we could have instant space de-duplication.

It'd be neat to add this as a backend option for the Object Store, that one could choose to use for storage. Before datasets went to this backend they'd get hashed (sha256, something else, multiple hashes combined?), and stored at a path based on that hash, just like how the uuid backend works. It could even leverage the dataset_hash table (though I'm not sure how this gets populated? Is there a flag I can enable somewhere?)

It wouldn't need to be particularly smart for a first pass (e.g. rejecting a file before transfer, if the hashes matched something inside), it could accept the file and internally decide it was already stored and just update the reference to it.

(I'd suggest ipfs but the performance numbers i've seen are staggeringly abysmal, and all I really want is the CAS portion.)

This issue arises because I have a user that's re-run Trim Galore! across the same dataset multiple times (it's part of the workflow!) which generates giant files that eat through our storage, and I end up with this situation where all of these outputs are bit for bit identical

bfc4cab41e2341844d73b247473e3a61  /data/galaxy/5/c/3/dataset_5c3cae0b-b6a0-4fb5-b2fc-b7f30baec184.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/9/0/f/dataset_90f24ac2-6fca-4af0-b36c-0deeffb7a351.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/7/3/c/dataset_73c4d8d5-db25-46e8-a1bb-6bbcbbe2e322.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/f/6/b/dataset_f6b552d0-1221-464e-a9df-c11445ed23fb.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/4/f/5/dataset_4f5b0ac7-7f10-4642-b509-7f2bc71b7624.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/0/6/a/dataset_06a56572-2d5b-4551-9cc2-e7a55ce37a25.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/1/b/b/dataset_1bbe30e8-ddb4-45d2-b8fe-f44d79b99622.dat
bfc4cab41e2341844d73b247473e3a61  /data/galaxy/a/c/3/dataset_ac3b0d84-4f9e-4c73-bf5a-8dbd149af17d.dat

If they'd been stored by the hash, we still would have wasted the compute time, but we won't waste the storage space (more precious in this scenario.) In order to not waste the compute time as well, we'd need improvements from https://github.com/galaxyproject/galaxy/issues/6887

natefoo commented 1 year ago

Related but maybe not the right issue, we could get a big win from using the precomputed hash of input sources that have them like S3. e.g. if importing a file from a bucket using the S3 file source plugin, four hashes are precomputed including CRC32 and SHA-256, and so we could skip that entire download if the data are already imported.

hexylena commented 1 year ago

ah that's fantastic. another great win. (same for TUS, we can get hashes from that too, right?)

nsoranzo commented 1 year ago

I think when we discussed this a few weeks ago at a Backend WG meeting we didn't reach a conclusion, but there were concerns about having to calculate hashes for big files as part of a dataset upload/creation. Also it was suggested that this may be better left to the file system layer (if using one that supports it). For dataset that are tool output, I think the job cache is the way to go, but for uploads where the the hash is calculated any way (like the S3 file source plugin and/or TUS mentioned above), that would indeed be a nice feature.

hexylena commented 1 year ago

Not sure what all was discussed in that WG, but, as an admin I think I'm fine with my users waiting an extra even 30 minutes to have a checksum on their large file they waited 5 hours to upload, for the tradeoffs it would give me without having to rebuild my infra with a checksumming/deduplicating filesystem.

I think the job cache is the way to go,

yeah, agreed, but I guess that will maybe never support cross-user account "deduplication", right? Even in a perfect world of identical hashes, we'd probably still limit it to per user just to prevent unintentional data leaks right?

natefoo commented 1 year ago

We don't even have to wait - unchecksummed data can go into a temporary "unhashed" object store and then be moved once the checksum/hash is complete and nothing is using it as an input.

I still think this would be an enormous win for the big servers.

hexylena commented 1 year ago

Agreed. And yeah, that would make migrating to a CAS easier, if things could be renamed once hashed

natefoo commented 1 year ago

Just to put some numbers on this, I wrote a script to find all duplicate data in a directory so we could get some kind of idea as to how much space we would save with it. Unfortunately, the script got interrupted at around 19% complete (and it's just a big find so there's no way to resume) walking the ~1.75 PB corral4 object store backend on .org. At the point it died, it was estimating we'd save about 133 TB from deduplication. That number was going up every time I recalculated.

cascalc2.sh

hexylena commented 1 year ago

Just to check my understanding, lower bound is 133 (saved)/1750 (assuming the full OS processed), so already ~10% saved by moving to a CAS? Knowing it'd be higher since we didn't scan the full 1750 TB?

natefoo commented 1 year ago

That's correct, yes.

hexylena commented 1 year ago

Crickey, that's a massive number.

mvdbeek commented 1 year ago

Unpopular opinion, if it's not at least a 50% saving it's not worth the headache ? We wouldn't do something like this for speedups if it made the architecture harder.

mvdbeek commented 1 year ago

(but i am all for calculating hashes with new data / if the object store supports it)

hexylena commented 1 year ago

That might indeed be an unpopular opinion, especially among folks paying for storage :laughing:

Your concern is that it makes the architecture harder? Could you elaborate on that, it would be really interesting (as an admin) to hear what you think it would take / how else it would impact the system!

mvdbeek commented 1 year ago

Could you elaborate on that,

We create the datasets up front, so we don't have the hash. We can't rely on the object store hash if we don't have it, so it's effectively something that needs to be coordinated out of band ... which is maybe what you'd want to do ? Use nate's script and create e.g. hardlinks ?

natefoo commented 1 year ago

Per our offline conversations about this, the main hurdle is the timing of moving that data to its CAS path in a manner that doesn't break anything. You can mostly safely do it if it's not a job input, although we don't have any locking to avoid race conditions. It would also be an issue with downloading, where we really don't know.

The issues aren't insurmountable but might be easier to solve in my head than in reality.

EDIT: That said, if we want to calculate a hash for everything anyway, then there is even less reason not to do this.

mvdbeek commented 1 year ago

You can mostly safely do it if it's not a job input, although we don't have any locking to avoid race conditions

It's a similar problem to cleaning up the object store cache. IIRC someone had a WIP PR that was at least excluding active jobs.

hexylena commented 1 year ago

thanks for the elaborations! appreciate it

natefoo commented 1 year ago

It could even be as simple as "hardlink the hash path, which will be used for all subsequent inputs/downloads, and then remove the uuid link after a month out of band." How often is that going to cause problems?

natefoo commented 1 year ago

Actually... in theory we don't even need to remove the UUID path if it's a hard link.

hexylena commented 1 year ago

Actually... in theory we don't even need to remove the UUID path if it's a hard link.

I mean yeah that's really what we want, the backing store is CAS, with a front-end that looks like an OS and makes the usual UUID based named files, with hardlinks to the CAS proper. just a process of cp and swapping that (atomically) with a hard link

hexylena commented 9 months ago

I was told of some issues the IRIDA folks faced with "how to manage large datasets being repeatedly copied into Galaxy by different users from an external system, to analyse in Galaxy", and just want to note that use case here—it would be completely solved by a CAS, they could repeatedly upload (or maybe not even upload if hashing is implemented in a nice way / they could provide dataset hashes), and not worry about the storage usage on the Galaxy side.

But they're moving to nextflow for the next version, so maybe it isn't relevant anymore

galaxyproject / galaxy

Deduplication via content addressable object store #15086