Open keighrim opened 3 years ago
I think this is a good idea. It does bring up some questions about the clams source
command, since the filepaths we provide there are not host paths but in-container paths, so the CLI tool as is would not be able to generate those checksums itself, but that'd be an issue for the clams and mmif-python repos, not this one.
I think CRC32 makes sense for this.
Good point. A simple solution I can imagine is to add a parameter to clams source
command to mend the file path on the fly (--prefix
sounds like a proper name). We can also add a flag to make clams source
generates checksum strings during generating source MMIF JSONs.
As discussed in the meeting today, Python's zlib
module has a CRC32 implementation. However, it also has zlib.adler32
, for which the docs state, "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly."
I don't know what "much" means here but it might be worth considering choosing Adler-32 as our standard instead.
Recent developments;
At some point, we had advertised that MMIF would encode file checksums in the
Document
objects for checking data integrity. I want to bring it to the discussion, specifically related to these questions;add_document
method of the MMIF SDK (maybe as an optional parameter). We could also consider implementing some helpers either in MMIF SDK or CLASM SDK to check the file integrity using the checksum string.