MetPX / wmo_mesh

minimal sample to demonstrate mesh network with a pub/sub message passing protocol (mqtt in this case.)
GNU General Public License v2.0
4 stars 2 forks source link

Make Sum fields clearer. #2

Open petersilva opened 5 years ago

petersilva commented 5 years ago

The ET-CTS committee found the sum="," notation idiosyncratic. One option:

"sum" = { "method" : "md5" , "value": "the checksum value" }

They also raised the slightly different notion of a signature that can accomplish the same thing as a checksum, while also confirming provenance. The suggestion is:

"signature" = { "method": "???" , "value": "the signature value" }

Currently, sum is a required field, but the proposal is to have one of sum or signature required.

petersilva commented 5 years ago

so having two separate fields complicates things... when the file is received, the integrity sum gets persisted using an extended attribute, so now we either persist sum or signature or both, and we read one or the other and if both are present, is there a precedence? For now, I implemented a change in the name of the field, it is called "integrity", instead of "sum", and the intent is to allow for signature algorithms to be allowed in addition to the simple checksums used now "integrity" = { "method" : "md5" , "value": "the checksum value" }

What do people think?

josusky commented 4 years ago

I guess we have agreed here some time ago, but perhaps @petersilva could bring here the recent discussion about unusual algorithms, like "arbitrary". I am quite OK with things like "FLK-SHA512" ( hash of the concatenation of the first and last kilobyte of the file (used for large files to avoid reading the whole thing.) It means a bit of work for me but makes perfect sense. Would we need also other size variants, e.g. "FLM-.." (first and last megabyte)? Or perhaps "FL4K-..."? It becomes a bit cryptic, but I could live with it.

petersilva commented 4 years ago

this comment mostly just reports what is in the Canadian implementations currently. All of the currently implemented ones are in use, and were added because of use cases encountered:

https://github.com/MetPX/sarracenia/blob/master/doc/sr_postv3.7.rst#sum-method-value describes the current implemented ones: ` "method" : "md5" | "sha512" | "md5name" | "link" | "remove" | "cod" | "random", "arbitrary"

`

Not present:

The FLK-SHA512 one is as you described it, not yet implemented, but thinking about it as in one use case, I need a compromise between no data checksum (such as name) and full data checksum (sha512)

petersilva commented 4 years ago

thoughts:

petersilva commented 4 years ago

competing/complementary/nested goals for the sum field.

  1. uniquifier .. Something to identify a file as the same or different from corresponding versions of itself.

  2. checksum ... something to confirm that the product actually received was not corrupted.

  3. signature .. something to confirm that the product was produced by someone posessing a certain key.

All the mesh algorithm needs is 1. These purposes encompass one another 3 does strictly more than 2, 2 more than 1. It also follows for bytes, a proper signature is going to be a lot more bytes than just a checksum, and in turn 2 will be say, 512 bytes, a lot more than a typical UUID. We could use separate data structures for all three, but it is tempting to somehow combine them.

petersilva commented 4 years ago

an example of identical data that differs. In North America, there is GOES DCS (Data communications Service) a low bandwidth uplink for automated stations. Various organizations/sites operate LRGS (Land readout ground stations) to pick up DCS data from a local satellite dish. Often there is a tail on the actual datum that gives information about signal strength and noise. Obviously such data is going to differ for every dish. People posting such data could make the data site neutral and binary identical if they strip off the radio metadata, but then people who want to know that would miss it. So ideally, a checksum that excluded that tail would be used.

petersilva commented 4 years ago

something that is constant is that an intermediary party does not know enough about the data to select an appropriate sum algorithm. The choice needs to be made by the source.

josusky commented 4 years ago
  • if we allow compression in the content field, I guess the checksum should apply to the uncompressed raw content, as that is easier to compare to the file on disk. so to validate checksum vs. content, one would have to decode the base64, unzip, and then read it a third time to validate the checksum.

I agree - the checksum must be independent on the actual transfer encoding/compression. Verification of the checksum makes sense only in the systems that are going to use the data and those will do the unpacking anyway. Moreover, "content" field is used only for small data.