Support for "content-formatting-lenient" hashes?

photocyte commented 1 year ago

Hashes like sha256 of the binary file content are highly sensitive to changes in the file, that may end up being practically inconsequential. An alternative strategy is to utilize hashes which are aware to the formatting structure of the file, and only hash the important content while ignoring the formatting.

Here are some examples:

sum command of seqkit, which produces a content-lenient hash of FASTA format files: https://bioinf.shenwei.me/seqkit/usage/#sum
(Can imagine something for a .docx, which pulls out the plaintext, while ignoring the formatting, but unaware of a premade tool for this off the top of my head)

Could it be possible that CAS would support such content-formatting-lenient hashes?

dennwc commented 1 year ago

To be honest this defeats the purpose of content addressing.

However, it's possible to adds some pipeline that extracts the structure/text from a file and hashes that as well. And then use those hashes to trace back to the source file.

photocyte commented 1 year ago

To be honest this defeats the purpose of content addressing.

I think it doesn't! It simply requires a broader definition of "content" to be addressed, than purely the information of the array of bytes on disk.

Here is some of the background motivation on the seqkit side of things, if interested: https://github.com/shenwei356/seqkit/issues/262

For .docx, of course, changing a single letter from Arial size 12, to Arial size 11.5 font, is totally indistinguishable to any human, and this would be an excellent use case for a "lenient" (or "robust", as I called it in the seqkit link) content hash. Whereas the hash off the binary content - yes that is going to be totally different.

Admittedly, this is a bit tricky in that an individual file format would need an individual "pipeline" to produce a lenient hash. The ideal case would be an external tool that could be used doesn't just assume the content type off the file suffix, i.e. .docx, but uses the binary content itself to classify the content type. I haven't come across an easy to use open source library or tool to perform this content-based classification.

dennwc commented 1 year ago

I agree that it's worth exploring, but as you noted, each file type would need it's own tool to extract the underlying data and/or structure.

In general CAS is mostly concerned with storing and syncing data. And for that, exact file content matters.

As for "lenient" hashing, for me it sounds like potentially an additional layer of indexing.

In CAS-based it's usually solved in a bit different way systems.

There are different kinds of "rolling" hashes, where the purpose is to detect some "interesting" patterns and cut the file into chunks based on it. So even though the hashes still changes even for a basic 1-byte edit, only a tiny portion of the file needs to be transferred for the sync. Perkeep uses rolling hashes, to give an example. In my case I decided to let the user decide on the file slitting. It can store either full file as single blob, or split by block. Rolling hash is also on TODO list.

Another approach is to write data to CAS directly in a structured way. If the tool will know that the file contains a list of smaller entries, it can index these entries separately and thus reordering won't matter that much. This is what I'm experimenting with in CAS2. Other example is IPFS/IPLD.

Maybe one of these approaches would be useful for your use case? I understand that full support of specific file format(s) would be ideal, but some of the approaches above may give a good-enough approximation. At the end it may be a matter of tooling: if you can easily check that 99% of the file is the same and/or only the order changed, maybe that will already answer your question(s)?

dennwc / cas

Support for "content-formatting-lenient" hashes? #4