FDSN / miniSEED3-TechnicalEvaluation

Discussion and evaluation of miniSEED 3
5 stars 1 forks source link

Requirement: Include a CRC (cyclic redundancy check) of the complete record #12

Open krischer opened 6 years ago

krischer commented 6 years ago

Include a CRC (cyclic redundancy check) of the complete record.

chad-earthscope commented 6 years ago

This requirement was discussed in change proposal 6 to the 2016 strawman. The Quaterra CRC algorithm was offered for use in change proposal 25 to the 2016 strawman.

In the previous format specifications from IRIS I suggested the adoption of the CRC-32C (Castagnoli) algorithm for the following reasons:

andres-h commented 6 years ago

I think the type of CRC is at this stage an irrelevant implementation detail. The question is rather if there should be a CRC of complete record, partial record or no record-level CRC at all.

I'm in favor of partial record CRC.

krischer commented 6 years ago

The type of CRC should probably also be discussed as this stage as stage 3 of the design hopefully only deals with complete implementations and the type of CRC is a detail that could be sorted out by then. Thus we have to things to discuss here: Which type of checksum calculation and what parts should be checksummed.

Excerpt from @andres-h link:

All proposals add some sort of record checksum. The problem with #1 and #2 is that adding anything to the record (for example a blockette or CBOR object describing the quality control procedure that was made at the data centre), even increasing the data publication version, just to indicate that the data passed quality control procedures, invalidates the checksum. Due to a hardware or software glitch, it can happen that data was corrupted, but the (new) checksum is correct.

The concept of the current proposal (#3) is that data should be modified as little as possible. In particular, adding "quality control passed" should not invalidate the primary checksum if no other changes were made to the record. Multiple checksums and hashes are supported (CRC-32, MD5, SHA, etc.)

I personally disagree with this and think that everything should be checksummed - its cheap and also serves as a safety mechanism when touching any part of a record to some extent. It could also be two separate 16bit checksums - one for the data, one for the header. The risk of false positives might still be small enough to have to worry about it.

andres-h commented 6 years ago

I agree that everything should be checksummed, but having a single checksum does not match with the requirement (?) of being able to modify the records in the datacentre (adding QC, etc.). As a user, I want to check if the CRC matches the one that was generated by the digitizer or not. Also, I want to know which changes, if any, were done in the datacentre. Assuming that digitizers support NGF directly, similar to mseed2/seedlink (requirement?).

crotwell commented 6 years ago

I feel the checksum should be over the whole record and modification of a record should force a checksum recalculation. The purpose of this checksum is simply to detect corruption of data during transmission or storage, not to provide "providence" of the data back to the digitizer. While that is an important concept, it is a much larger problem and to do it correctly it needs to be done separately from the timeseries format.

For this purpose I think simpler is better, and so the CRC-32 makes most sense.

chad-earthscope commented 6 years ago

I agree with @crotwell. The most value of adding a CRC is to check whether the record has been corrupted during transmission or storage. Even if multiple CRCs could be used in a provenance scheme, requiring a reader to calculate multiple CRCs to do the "has this record been corrupted" check is not justified.

krischer commented 6 years ago

Summary

(Please let me know if I missed a point or misunderstood something)

There is agreement that we want CRC but not what algorithm or what "type" of CRC. Technically it is also clear that the CRC field must be set to some pre-determined value (or ignored) for the actual calculation of the CRC. Thus please vote on:

  1. What should the CRC include? (complete record / partial record / no record-level CRC)
  2. What algorithms should be used? (CRC-32/other suggestion)
crotwell commented 6 years ago

1 complete record 2 crc-32c

chad-earthscope commented 6 years ago
  1. Complete record with the CRC field set to 0's.
  2. CRC-32C
kaestli commented 6 years ago
  1. complete record
  2. crc-32
ozym commented 6 years ago
  1. complete record as per @chad-iris
  2. crc-32
claudiodsf commented 6 years ago
  1. What should the CRC include? (complete record / partial record / no record-level CRC)

Complete record

  1. What algorithms should be used? (CRC-32/other suggestion)

CRC-32 or any other lightweight algorithm

ihenson-bsl commented 6 years ago

1 complete record 2 crc-32c

ValleeMartin commented 6 years ago
  1. Complete record
  2. CRC-32 (or similar lightweight algorithm)
JoseAntonioJara commented 6 years ago
  1. Complete record
  2. CRC-32