encoding integrity of source documents

clamsproject / mmif

MultiMedia Interchange Format

Apache License 2.0

5 stars 1 forks source link

encoding integrity of source documents #150

Open keighrim opened 3 years ago

keighrim commented 3 years ago

At some point, we had advertised that MMIF would encode file checksums in the Document objects for checking data integrity. I want to bring it to the discussion, specifically related to these questions;

I think the data integrity is important, especially MMIF, unlink LIF, doesn't carry the contents of raw source data. What I'm not sure is whether encoding checksum hash string is the best way to do it.
If we encode it, we need a standardized way (e.g. CRC32) of doing it, and it must be specified in the documentation.
Also I think the implementation of generating checksum string should go in the add_document method of the MMIF SDK (maybe as an optional parameter). We could also consider implementing some helpers either in MMIF SDK or CLASM SDK to check the file integrity using the checksum string.

angus-lherrou commented 3 years ago

I think this is a good idea. It does bring up some questions about the clams source command, since the filepaths we provide there are not host paths but in-container paths, so the CLI tool as is would not be able to generate those checksums itself, but that'd be an issue for the clams and mmif-python repos, not this one.

I think CRC32 makes sense for this.

keighrim commented 3 years ago

Good point. A simple solution I can imagine is to add a parameter to clams source command to mend the file path on the fly (--prefix sounds like a proper name). We can also add a flag to make clams source generates checksum strings during generating source MMIF JSONs.

angus-lherrou commented 3 years ago

As discussed in the meeting today, Python's zlib module has a CRC32 implementation. However, it also has zlib.adler32, for which the docs state, "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly."

I don't know what "much" means here but it might be worth considering choosing Adler-32 as our standard instead.

keighrim commented 2 months ago

Recent developments;

we might want to use a hash function that matches the near-identical assets based on contents, besides a strict hash for byte streams. (e.g., https://pypi.org/project/videohash/)
that said, we might want to allow multiple hashes with their "specs" stored in the MMIF serialization
the primary purpose of this hash records is not any security measure, so cryptographic level isn't our first consideration (e.g., https://xxhash.com/)