DataONEorg / hashstore

HashStore, a hash-based object store for DataONE data packages
Apache License 2.0
1 stars 1 forks source link

HashStoreClient - convert directories to HashStore directories #118

Closed doulikecookiedough closed 2 months ago

doulikecookiedough commented 3 months ago

In preparation for the Metacat 3.1.0 release which will include HashStore, we will need a way to convert an existing /var/metacat/data and /var/metadata/documents directories into HashStore directories for the upgrade process. To be efficient with the conversion, instead of moving the files, we want to create symlinks to them instead.

So the existing /data and /documents directory is where the real copies of the "old" files exist. And new files/uploads will be stored into HashStore directly.

To Do:

mbjones commented 3 months ago

This sounds like a great idea. Please do use "hard" links (and not "symbolic" links), which means that each file will be stored once and its old location and its new location both point at the same inode -- all hard links are truly equivalent pointers to the file content. Once you have all of the files hard-linked into hashstore, you will be able to remove the old file links without any loss of data, and the new file links will remain. One consideration is whether (or how) CephFS supports hard links -- I am pretty sure it does (@jeanetteclark used them IIRC), but there may also be some (major?) efficiency gains in doing this as requests to the Ceph MDS API rather than as POSIX filesystem calls.

mbjones commented 3 months ago

I think this is the API for creating a hard link: https://docs.ceph.com/en/latest/cephfs/api/libcephfs-py/#cephfs.LibCephFS.link. There is also a method to create a symlink.

doulikecookiedough commented 3 months ago

Thank you @mbjones for your feedback, suggestion and link to docs! The "hard" links direction gives me some peace of mind (as I was worried about what would happen if the original data/metadata objects were deleted if I have symlinks pointing to them). I will look into it and follow up if I have any other questions.

doulikecookiedough commented 3 months ago

After further discussion, the process or a tool to convert existing Metacat /data and /documents directories into HashStore directories should be controlled by Metacat (and not a hashstoreclient script or process). While HashStore can support this feature, Metacat should coordinate. A new issue(s) will be created in Metacat's repo after syncing up with the team.

To Discuss: Proposed Metacat HashStore Upgrade Process

1) Metacat will check postgres to see if it is fresh install or an upgrade, and suggest (force) an update

doulikecookiedough commented 2 months ago

Closing this issue - this utility method/process will not be part of the HashStore library. Metacat will fold this into its upgrade process (TBD).