Closed zhubonan closed 1 year ago
Proof of concept PR #126
Pinning @giovannipizzi @chrisjsewell
After discussion with @zhubonan and @chrisjsewell the following design could be envisaged:
archived-packs
ArchivedPacks
, that has just two columns, the pack_id
and the location
(there should be a unique constraint on the pack_id
column): existence of a pack id in this column means the pack should not be looked for into the packs
subfolder, but in the location
subfolder, that by default is archived-packs
dostore
cmdline commands) to move a pack to the archived directory (possibly with a custom name, and checking that this does not overlap with known names like sandbox
and loose
(or, each folder should be inside archived-packs/<LOCATION>
where <LOCATION>
is the value of the location
column. This would take care of moving the pack in a way that is aware that the destination might be in a different file-system: e.g. first check that the archive is sealed
(see issue #124, we should define the concept of a "sealed" pack and only move that, and disallow to add to that pack afterwards); then copy it over; then (after checking the MD5 to ensure the pack was successfully copied?) add the entry in the ArchivedPacks
table; then (maybe as a maintenance operation) remove the pack from packs
and only keep the archived version.packs/
folder.
As a power user, I can then create folders inside archived-packs and mount them from some remote location. In this way, archiving will allow to move big data to other locations.
In addition, there should be a function to check that all packs are actually there (e.g. to avoid that one of the archived folders is not mounted - and ideally also add the checksum for further validation?). The simple check of file existence should hopefully be fast, and should be done every time you create a new container instance, otherwise an exception is thrown?
Finally, it should be easy for the user to archive the packs. E.g. one could have a command dostore archive-packs --keep-last=2 [--location=nfs]
, where --location
might be optional and we might have a default location like archive
; the command will take all unsealed archives, keep the last 2 in the packs/
folder, and "move" all the rest to the archived-packs folder as described above.
@giovannipizzi Thanks for the summary!
One potential issue I can think of with this is that if the user have multiple profiles and hence multiple repositories, one can potentially make mistakes when mouting the correct folder inside archive-packs
to the right disk-objectstore
container. If such misktake is made, my impression is that the current implementation would return an incorrect stream?
At the moment the packs are stored as numbered files, eg. 1
, 2
, ,3
, would it make sense to add some kind of identifier to the pack file names, such as 1_<uuid-of-container>
to avoid potential errrors?
good point, thanks! Either that, or have a JSON in the folder that gives information. But I agree
After re-discussion with @zhubonan we realized that the logic described here is probably too complex. Probably the easiest is to mount just the packs
subfolder in a different location. This should typically be sufficient for most use cases. I will therefore close this as a wontfix
One problem I face with my current AiiDA-based workflow is the growing size of the repository verses the finite size of the fast SSD storage. This can happen quite quickly if I had to run a few "large" caclulations for which a lot of data is needed during post-processing and provenance critical . In theory, most of the files stored by AiiDA are not frequently accessed and they are perfectly fine to sit on a slow storage position, e.g. spinning disk or NFS mounts. On the other hand, having the whole repository on a slow storage location can slow down the daemon and workflows.
I think this package can give a natural solution to this problem. Here, the loose "objects" can be written onto a fast-to-write disk. The read-only access of the "fully" packed packs no longer benefit from fast disk speed, so they can be moved into a slow storage if needed, e.g:
At the moment, all of the (integer numbers) packs are stored under the
packs
folder, would it be possible to allow multiple storage positions to be used (for fully "packed" ones)? I think it should just be a matter of iterating over the storage locations and check if the file exists, or a dictionary of pack id and their locations can built when theContainer
class is instantiated to reduce the overhead.Please let me know what do you think about thsi idea. Thanks!