dennwc / cas

Content Addressible Storage
Apache License 2.0
41 stars 3 forks source link

Resolving the path of the file from the cas hash? #3

Open photocyte opened 1 year ago

photocyte commented 1 year ago

Hi there,

Is it possible to resolve the full path of the file from the cas hash? (i.e. analogous to cas blob, but returning the local filepath instead).

I'm imagining the use case where I could keep better track of large files that are identical on both a remote and local storage, but might have distinct paths / be moving around.

Feel free to say if this is a misunderstanding of how content addressable storage can/should work.

dennwc commented 1 year ago

At least at this moment CAS heavily relies on the filesystem. As you mentioned in another issue, it uses xattr for storing hashes. Usually filesystem doesn't provide a way to quickly find files by xattr, so some other functionality must be implemented for such lookup.

If I understood your use case correctly, you want to index the whole folder, but keep the files at their original location. Then, if file is moved, you want CAS to know the new location of the file, correct?

I think it should be possible: CAS could use hardlinks to copy files ("index" them in it's own directory) and then use some filesystem API to find all the other files linked to the same underlying data. This sounds like a good idea, but it needs some research on my side.

photocyte commented 1 year ago

Yes! I think you get the jist of what I was looking for.

A few comments on hardlinks below. Is it possible to instead store the file inode in a db (i.e. sqlite) & use that to lookup the file? I agree, seems best to rely on filesystem APIs where possible (although that will vary across filesystems).

(1) Poor (if any?) support on cloud storage systems for hardlinks. Rclone doesn't mention them: https://rclone.org/overview/ . So, if a cas directory was naively copied onto cloud storage, would break.

(2) Hardlinks also kind of a subtle thing even on local filesystems. With rsync, -a archive mode ignores them, whereas need the -H parameter to copy them right (recalling from my memory, may not be exactly right). I'm not sure how the OS native file copy dialogs deal with hardlinks but I'd expect "not well".

(3) Unclear to me how common archive formats like .zip or .tar.gz handle hardlinks. They may or may store & reproduce them w/ default settings, and that might vary across implementations across OSs.

photocyte commented 1 year ago

Worth mentioning, the Recoll local search engine can index xattrs, so might make it possible to search for particular CAS generated sha256 xattrs: https://www.lesbonscomptes.com/recoll/pages/index-recoll.html