Requirement: Identification of non-raw, derived data

krischer commented 6 years ago

Allow for identification of non-raw, derived data (e.g. processed data, quality parameters, metadata versioning, synthetic data).

chad-earthscope commented 6 years ago

We support this and are increasingly seeing the need to clearly identify synthetic, processed and derived data.

This requirement seems like a sub-bullet to #4, which is a relatively large sub-topic in it's own right.

krischer commented 6 years ago

Any ideas how this could look like? A free-form ASCII string after the identifier proposed in #4?

crotwell commented 6 years ago

Especially for the derived data, we should be able to identify the channels that the new timeseries came from. For example some are recording latency at a receiving node of a input channel as a new timeseries. Another case, where there would be more than one derived from channel, would be deriving a North channel from a borehole instrument with non-traditional orientations.

A standard "derived from" key could be done as part of the optional/additional headers. This does somewhat mix metadata into timeseries data, but for items as simple as latency or rotations it might be acceptable, and as far as I know StationXML does not have the ability to specify this type of derivation.

I would argue that unless the processing or derivation is trivial or close to it, that it is better not to mix the determination of the codes of a new channel, an identification problem, with linking to the source channels, a metadata problem. This is especially true if the fundamental nature of the data changes, ie latency of a ground motion channel.

jmsaurel commented 6 years ago

It looks a little like the data quality flag of miniSEED2.4 (R, D, Q or M) but with extended capabilities, isn't it ?

I'm in favor of something that allows clearly to identify synthetic channels, or derived channels (ie, samples whose values from the digitizer have been modified). Maybe an extended version of the data quality flag.

I'm not in favor of placing in the data informations about where do this new data comes from. This should be kept in the metadata.

Regarding the indication of quality verifications on the data that don't affect at all the values of the samples (ie, only qualifying, or removing bad data), it could be taken by the versioning #13

krischer commented 6 years ago

A simplistic possibility would be to somehow enhance the quality codes and add two new codes for synthetic and derived data (are there other broad categories?) and then delegate further details to the arbitrary headers of #14 as proposed by @crotwell.

andres-h commented 6 years ago

Would BHZ be a "derived channel", since it is derived from HHZ?

jmsaurel commented 6 years ago

If BHZ comes directly out of the digitizer, I wouldn't call it a "derived channel", because you don't know how it's made inside. It could be derived from the HHZ, but it could come from a different filter stream.

But if BHZ is made by the acquisition software (such as SC3, for example), then it could be called "derived channel" because it's no more data than comes out straight of the digitizer box.

tim-iris commented 6 years ago

Isn't this really an issue where we are implying that we must capture provenance. If so, and I think it is, then I do not think this really belongs in the time series exchange format. Provenance is a much bigger issue and could unnecessarily complicit things. Any expansion of the Quality code should be though through very carefully.... I have concerns with this.

krischer commented 6 years ago

Summary

(Please let me know if I missed a point or misunderstood something)

This is a bit of a complicated issue. I think we agree that full and proper provenance is not in the scope of the next generation data format and must be delegated to the meta data in some form. Also where exactly this information should go in the format is not clear and there are a large number of possibilities. Thus please vote on the following issue:

Should there be a simple way to flag time series in the new format as either "raw" (whatever the exact definition of that is), "derived" (not "raw"), or "synthetic" (not based on actual recordings)? (Yes/No)

crotwell commented 6 years ago

Yes

chad-earthscope commented 6 years ago

Yes.