Open petersilva opened 5 years ago
so having two separate fields complicates things... when the file is received, the integrity sum gets persisted using an extended attribute, so now we either persist sum or signature or both, and we read one or the other and if both are present, is there a precedence? For now, I implemented a change in the name of the field, it is called "integrity", instead of "sum", and the intent is to allow for signature algorithms to be allowed in addition to the simple checksums used now
"integrity" = { "method" : "md5" , "value": "the checksum value" }
What do people think?
I guess we have agreed here some time ago, but perhaps @petersilva could bring here the recent discussion about unusual algorithms, like "arbitrary". I am quite OK with things like "FLK-SHA512" ( hash of the concatenation of the first and last kilobyte of the file (used for large files to avoid reading the whole thing.) It means a bit of work for me but makes perfect sense. Would we need also other size variants, e.g. "FLM-.." (first and last megabyte)? Or perhaps "FL4K-..."? It becomes a bit cryptic, but I could live with it.
this comment mostly just reports what is in the Canadian implementations currently. All of the currently implemented ones are in use, and were added because of use cases encountered:
https://github.com/MetPX/sarracenia/blob/master/doc/sr_postv3.7.rst#sum-method-value describes the current implemented ones: ` "method" : "md5" | "sha512" | "md5name" | "link" | "remove" | "cod" | "random", "arbitrary"
`
We can drop the md5 based ones because people object to using an old hash (even if it serves the purpose just fine... not going to argue.) so
sha512 involves using that algorithm to checksum the entire file. The obvious best choice, but reading the entire file is costly when they are big.
md5name just use the file name. we could just rename it to "name", and omit the value, as it would simply extract the file name from relpath. would be more compact, and not tie down to an algorithm. This is used in RADAR production case, where the filenames produced by multiple production chains are equivalent, but not bitwise identical.
link is used to identify that the post is for a symbolic link. the field was there, the symlink has a separate link field with the value of the link. so this can likely be omitted entirely. perhaps clarified by #10 This is used in HPC mirroring use cae.
remove - designates a file that is to be removed. Since the post is often done after the file is gone, we don't know what it's checksum was. so the value would use the name algorithm. Perhaps clarified by #10 . Also used in HPC mirroring use case.
cod -- calculate on download. to save the initial poster from reading the file, have the first recipient that is downloading (therefore already reading) the file calculate the checksum as the download occurs. The "value" is the integrity method to use to calculate the checksum. Often used when polling a remote (non AMQP pub/sub) site, so the checksums aren't available until after they are downloaded.
"random" -- only useful in debugging/testing. generate a random value for a checksum
"arbitrary" -- the value was determined by an algorithm unknown to the transport layer. It is thus an opaque value that must be preserved for comparison should future versions of the same file appear. This is done in Sarracenia using extended attributes on linux, and Alternate Data Streams on Windows. Required for some data sources that do not provide a public algorithm.
Not present:
The FLK-SHA512 one is as you described it, not yet implemented, but thinking about it as in one use case, I need a compromise between no data checksum (such as name) and full data checksum (sha512)
thoughts:
an idea of a general one that people might be able to use is SHA512 with some kind of grammar for expressing a subset of the data to use. M50-SHA512 (middle 50% of the file?)
if we allow compression in the content field, I guess the checksum should apply to the uncompressed raw content, as that is easier to compare to the file on disk. so to validate checksum vs. content, one would have to decode the base64, unzip, and then read it a third time to validate the checksum.
competing/complementary/nested goals for the sum field.
uniquifier .. Something to identify a file as the same or different from corresponding versions of itself.
checksum ... something to confirm that the product actually received was not corrupted.
signature .. something to confirm that the product was produced by someone posessing a certain key.
All the mesh algorithm needs is 1. These purposes encompass one another 3 does strictly more than 2, 2 more than 1. It also follows for bytes, a proper signature is going to be a lot more bytes than just a checksum, and in turn 2 will be say, 512 bytes, a lot more than a typical UUID. We could use separate data structures for all three, but it is tempting to somehow combine them.
an example of identical data that differs. In North America, there is GOES DCS (Data communications Service) a low bandwidth uplink for automated stations. Various organizations/sites operate LRGS (Land readout ground stations) to pick up DCS data from a local satellite dish. Often there is a tail on the actual datum that gives information about signal strength and noise. Obviously such data is going to differ for every dish. People posting such data could make the data site neutral and binary identical if they strip off the radio metadata, but then people who want to know that would miss it. So ideally, a checksum that excluded that tail would be used.
something that is constant is that an intermediary party does not know enough about the data to select an appropriate sum algorithm. The choice needs to be made by the source.
- if we allow compression in the content field, I guess the checksum should apply to the uncompressed raw content, as that is easier to compare to the file on disk. so to validate checksum vs. content, one would have to decode the base64, unzip, and then read it a third time to validate the checksum.
I agree - the checksum must be independent on the actual transfer encoding/compression. Verification of the checksum makes sense only in the systems that are going to use the data and those will do the unpacking anyway. Moreover, "content" field is used only for small data.
The ET-CTS committee found the sum="," notation idiosyncratic. One option:
"sum" = { "method" : "md5" , "value": "the checksum value" }
They also raised the slightly different notion of a signature that can accomplish the same thing as a checksum, while also confirming provenance. The suggestion is:
"signature" = { "method": "???" , "value": "the signature value" }
Currently, sum is a required field, but the proposal is to have one of sum or signature required.