aiidateam / disk-objectstore

An implementation of an efficient "object store" (actually, a key-value store) writing files on disk and not requiring a running server
https://disk-objectstore.readthedocs.io
MIT License
15 stars 8 forks source link

allowing multiple pack storage locations #123

Closed zhubonan closed 1 year ago

zhubonan commented 2 years ago

One problem I face with my current AiiDA-based workflow is the growing size of the repository verses the finite size of the fast SSD storage. This can happen quite quickly if I had to run a few "large" caclulations for which a lot of data is needed during post-processing and provenance critical . In theory, most of the files stored by AiiDA are not frequently accessed and they are perfectly fine to sit on a slow storage position, e.g. spinning disk or NFS mounts. On the other hand, having the whole repository on a slow storage location can slow down the daemon and workflows.

I think this package can give a natural solution to this problem. Here, the loose "objects" can be written onto a fast-to-write disk. The read-only access of the "fully" packed packs no longer benefit from fast disk speed, so they can be moved into a slow storage if needed, e.g:

At the moment, all of the (integer numbers) packs are stored under the packs folder, would it be possible to allow multiple storage positions to be used (for fully "packed" ones)? I think it should just be a matter of iterating over the storage locations and check if the file exists, or a dictionary of pack id and their locations can built when the Container class is instantiated to reduce the overhead.

Please let me know what do you think about thsi idea. Thanks!

zhubonan commented 2 years ago

Proof of concept PR #126

Pinning @giovannipizzi @chrisjsewell

giovannipizzi commented 2 years ago

After discussion with @zhubonan and @chrisjsewell the following design could be envisaged:

As a power user, I can then create folders inside archived-packs and mount them from some remote location. In this way, archiving will allow to move big data to other locations.

In addition, there should be a function to check that all packs are actually there (e.g. to avoid that one of the archived folders is not mounted - and ideally also add the checksum for further validation?). The simple check of file existence should hopefully be fast, and should be done every time you create a new container instance, otherwise an exception is thrown?

Finally, it should be easy for the user to archive the packs. E.g. one could have a command dostore archive-packs --keep-last=2 [--location=nfs], where --location might be optional and we might have a default location like archive; the command will take all unsealed archives, keep the last 2 in the packs/ folder, and "move" all the rest to the archived-packs folder as described above.

zhubonan commented 2 years ago

@giovannipizzi Thanks for the summary!

One potential issue I can think of with this is that if the user have multiple profiles and hence multiple repositories, one can potentially make mistakes when mouting the correct folder inside archive-packs to the right disk-objectstore container. If such misktake is made, my impression is that the current implementation would return an incorrect stream?

At the moment the packs are stored as numbered files, eg. 1, 2, ,3, would it make sense to add some kind of identifier to the pack file names, such as 1_<uuid-of-container> to avoid potential errrors?

giovannipizzi commented 2 years ago

good point, thanks! Either that, or have a JSON in the folder that gives information. But I agree

giovannipizzi commented 1 year ago

After re-discussion with @zhubonan we realized that the logic described here is probably too complex. Probably the easiest is to mount just the packs subfolder in a different location. This should typically be sufficient for most use cases. I will therefore close this as a wontfix