Open krischer opened 6 years ago
As one idea described in an draft of new identifiers, posted to #4, the location code could be enhanced in the following ways:
First - Band code, identifying the general sample rate and response band of instrument Second - First instrumentation code, identify the family to which the sensor belongs Third - Not always present. Second instrument code, extended sensor identification Last - Orientation code, indicate the directionality of the sensor measurement
The first code in the sequence is always the band code, the second is always the first instrumentation code and the last is always the orientation code.
The 3-character channel codes are equivalent to the SEED 2.4 channels, as documented in the SEED manual and the IRIS FDSN Identifiers draft of 2018-1-3. Essentially, this scheme adds an optional character to define further instruments. These additional "instruments" could include derived time series, etc.
The scientific users of SEED data are relatively comfortable with the nomenclature of channel naming. I think it is very important to recognize that any change to any identifiers, but especially the channel code given how much information is packed into it, is potentially disruptive to users of the data.
Proposal by @tim-iris (https://github.com/FDSN/miniSEED3-TechnicalEvaluation/issues/4#issuecomment-359118116):
I think there should be some discussion related to the Channel field since it is really trying to specify three different attributes of a channel in a single field. Would it make sense to break out the current three fields separately into BandCode, Instrument code, and orientation. It would give greater flexibility than keeping them together as one. Users could still specify things such as BHZ but the interfaces would map those into B_H_Z for instance for query processing. Users might not be impacted but data generators could have greater flexibility and capability.
Regarding the proposal from @chad-iris above:
One change I would like to see is to move the identifier for synthetic data (currently an X
in the instrument code in SEED 2.4) to the band code. The band limit/sampling rate of synthetic data is also not that meaningful in many cases. Then the instrument code can be used to specify the type of data (synthetic rotational data, synthetic strains, ...).
One change I would like to see is to move the identifier for synthetic data (currently an X in the instrument code in SEED 2.4) to the band code. The band limit/sampling rate of synthetic data is also not that meaningful in many cases. Then the instrument code can be used to specify the type of data (synthetic rotational data, synthetic strains, ...).
Currently 18 of 36 possible (26 letters + 10 digits) are allocated, so there are characters available to identify a synthetic or even derived time series in the "band" code.
So you are thinking something like "XHZ" for a synthetic, broadband seismic trace? That would even be backwards compatible with mseed2. The same could be done for derived data, e.g. "YGDZ", where the "Y" means derived, the "GD" is for geodetic (if we adopt 2-character instrument codes) and the "Z" is orientation. This seems OK. It definitely distorts the meaning of the band code, but may be worth it.
One user-level issue is that seleting, for example, "*HZ" to get broadband seismic data, would match synthetic data too.
The channel code currently seems to have more information packed into it than the other 3 (net, sta loc) and as such maybe it should expand to 8 chars as well. Perhaps with some limitation that 3 or 4 char codes be restricted to follow the fdsn channel naming convention, but 5-8 length codes, maybe combined with the X,Y,Z additional band codes above, are open for synthetic/derived/processed data uses?
Also allow the dash char like in the station and loc codes?
There is value in having station, location and channel codes all be the same length and based on the same character set.
We could also adapt the convention that as soon as there are dashes in the "channel" code then its either derived or synthetic data. There is also value in having more than one value for the orientation - generic derivatives of the wavefields for example would require two orientation components.
If we modify/expand the channel code to better identify derived or synthetic data while retaining information about the instrument (like XHZ or X_BHZ), then an expansion of the quality flag #10 to indicate synthetic/derived data is no more necessary.
One might as well leave the 3-character channel naming convention as it is now in order to maintain compatibility with miniSEED. Either the current band code or the current instrumentation code could be used to flag synthetic data without introducing conflicts with the current channel naming.
The flagging of derived or synthetic data only requires 1 bit each. Too little to justify an incompatible change to the current channel naming.
On a side note, the miniSEED data quality indicator may also be used to flag both derived or synthetic data. There would be no conflict with the currently defined quality indicators.
The primary reason for expanding the channel code is that (almost) all of the instrument codes have been used up. As a revision like this only happens after decades, leaving the channel code at 3 chars would be a mistake I believe.
I prefer the longer 8 char limit, but we should do at least 4 chars to allow additional instrument codes. So much the better if we can also handle synthetic/derived/processed within the channel code as well.
The current SEED 2.4 specification lists as instrument codes:
Seismometers: H, L, G, M, N Tilt Meter: A Creep Meter: B Calibration Input: C Pressure: D Electronic Test Point: E Magnetometer: F Humidity: I Rotational Sensor: J Temperature: K Water Current: O Geophone: P Electric Potential: Q Rainfall: R Linear Strain: S Tide: T Bolometer: U Volumetric Strain: V Wind: W Derived or Generated Channel: X Non-specific Instruments: Y Synthesized Beams: Z
That's a lot and I agree that there are not many left. But it's also a huge coverage of instrument types. What additional instrument types are on the agenda to be added to the scheme?
Given the disruption of changing the net, station and loc codes, why not take to opportunity to give more room to the channel code? Rather than asking what additional types are needed, the question really should be are we positive that no new types will be needed over the lifetime of the format. If there is even a chance of new instrument types, then the expansion is worth it I feel.
As it was suggested for network and station, maybe we could have room for new sensors or instrument type, but keep the actual ones for any new data corresponding to the existing instruments.
The only problem I see with the actual list is the identification of derived or synthetized data. Because they concern an instrument, we loose the information as to which data they correspond to. For example, according to SEED2.4, a vertical stream at 1Hz derived from a tilt-meter would be LXZ. A vertical stream at 1Hz derived from a broadband sensor would be also LXZ. That's confusing.
I think either an extension of the quality code #10 or of the channel code is needed, at least for that. For the first solution, we would have, for example LAZ.{R,D,Q,M} for the tiltmeter raw data and LAZ.X for the tiltmeter derived data, LHZ.{R,D,Q,M} for the seismometer and LHZ.X for the seismometer derived data. For the second, we would have, LAZ for the tiltmeter raw data and X_LAZ for the tiltmeter derived data, LHZ for the seismometer and X_LHZ for the seismometer derived data. The drawback of the second solution is that it breaks compatibility with miniSEED.
why not take to opportunity to give more room to the channel code
Agreed, but giving more room doesn't necessarily require a change in the channel naming at this stage.
Or, if you are really keen to change the channel naming, why not take the opportunity to split the band, instrument and orientation codes into truly separate 1-or-more-character fields? I hate the idea of having to follow rules like "if the channel code consists of more than 4 characters and the second character is a dash followed by a digit then...." etcpp. If more instrument codes are needed than available now, why not allocate 2 characters in a dedicated instrument code field?
I agree with Phillip’s observation, Also the channel naming is currently just an FDSN convention that does not preclude other channel namings. At some point we should consider if conventions should become a stronger prat of the format. From a users perspective this would be useful.
Cheers
Tim Ahern
Director of Data Services IRIS
IRIS DMC 1408 NE 45th Street #201 Seattle, WA 98105
(206)547-0393 x118
(206) 547-1093 FAX
On Jan 25, 2018, at 6:36 AM, Philip Crotwell notifications@github.com wrote:
The primary reason for expanding the channel code is that (almost) all of the instrument codes have been used up. As a revision like this only happens after decades, leaving the channel code at 3 chars would be a mistake I believe.
I prefer the longer 8 char limit, but we should do at least 4 chars to allow additional instrument codes. So much the better if we can also handle synthetic/derived/processed within the channel code as well.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FDSN/miniSEED3-TechnicalEvaluation/issues/30#issuecomment-360441844, or mute the thread https://github.com/notifications/unsubscribe-auth/Ah7veR1Ci3l9D5YQNRCyryJ17wRJYFqzks5tOGdEgaJpZM4Rbiqd.
The only problem I see with the actual list is the identification of derived or synthetized data. Because they concern an instrument, we loose the information as to which data they correspond to. For example, according to SEED2.4, a vertical stream at 1Hz derived from a tilt-meter would be LXZ. A vertical stream at 1Hz derived from a broadband sensor would be also LXZ. That's confusing.
I think either an extension of the quality code #10 or of the channel code is needed, at least for that. For the first solution, we would have, for example LAZ.{R,D,Q,M} for the tiltmeter raw data and LAZ.X for the tiltmeter derived data, LHZ.{R,D,Q,M} for the seismometer and LHZ.X for the seismometer derived data. For the second, we would have, LAZ for the tiltmeter raw data and X_LAZ for the tiltmeter derived data, LHZ for the seismometer and X_LHZ for the seismometer derived data. The drawback of the second solution is that it breaks compatibility with miniSEED.
The LHZ.{R,D,Q,M}
and LHZ.X
does indeed seem pretty nice when generating derived data and also synthetic data based on real measurements.
But the concept of the observational band code does in my experience not translate directly to synthetic data as it often highly oversampled and its valid frequency band also strongly depends on the used numerical method and the chosen physical approximations.
I also like @jsaul proposal to split the channel code into multiple fields (if we break compatibility in any case I think this would be feasible):
I furthermore propose to make each field as least two chars long for future proofing.
While the concept of channel codes has some illustrative and historical justification, it is conceptually broken already now and will, in the form of some combination of predefined letters, not make the step to the broader scope of the next generation data format.
As soon as you have synthetic data, you would like to differentiate types of synthetisation mechanisms, and as soon as you have processed data, you would like to know how it was processed.
If you are as serious on wind as on soil shaking, you would like to differentiate sensors measuring spoon wheel rotation from sensors measuring differntial pressure and sensors measuring chill (as we differentiate in H, L, G, M, N in seismometers); same for other types of data streams.
This will not be covered in a classification scheme with final enumeration of allowed classes (mapping to individual letters), and it will not be done without parameters (e.g. describing sensor orientation also if it is not N, E, or Z)
So I strongly believe this can only be adequately solved in the metadata (next generation station XML). I would suggest to leave channel identifiers as they are. For the traditional FDSN interests and use cases, they may be good enough, and other communities will anyway not rely on the SNCL-Part of the FDSN identifier, but use their own identification schemes.
I recommend to leave FDSN channel codes unchanged.
@kaestli:
So I strongly believe this can only be adequately solved in the metadata (next generation station XML).
You cannot rely on the current channel codes for anything but highest level classification, instead the metadata must be consulted to know what the instrument actually is, sampling rate, location, etc., etc. So, we are already in an ecosystem where the metadata must be consulted.
If I understand your argument to be that since we cannot fully describe an instrument in the identifier we should not have any description of the instrument or extend the existing model, then I disagree. It is extremely useful to have high-level classification of types in the identifiers for many practical purposes such as searching/browsing for data of a certain class or type. They are even useful to limit the search of metadata for a more fine-grained selection.
As @jsaul writes above, there are already a huge coverage of instrument types supported. A relatively small enhancement could cover more and future top-level types. The only two I have heard raised are synthetic and derived data, I expect more will come in time.
I also like @jsaul proposal to split the channel code into multiple fields (if we break compatibility in any case I think this would be feasible):
Band code (optional as not always sensible) Data quality/Data type code (observed [RDQM], derived, synthetic, ...) Instrument code Orientation Code
A small aside: I'm not in favor of dropping much information, but data quality codes are an exception. They really should be retired or mapped to version and/or retained in an optional field. They are misnomers (they mean very little regarding quality) and are generally unimportant to users who just want the best/latest. Overloading them with derived or synthetic designators does not feel right, those feel more like an instrument, i.e. a generator of data.
Otherwise, I agree that splitting the channel sub-codes into separate fields would provide easier expansion and clarity. As suggest by @tim-iris they could be separated using the same separators as the other fields. For illustration, putting it all together (minus the "quality") we get something like this:
FDSN:<network>_<station>_<location>_<band>_<instrument>_<orientation>
where we define the channel
as <band>_<instrument>_<orientation>
, but that's just a matter of documentation, they would look like above. Alternatively, the channel
could be sub-divided using a different delimiter, such a the .
(dot) suggested above:
FDSN:<network>_<station>_<location>_<band>.<instrument>.<orientation>
If we abandon the concept of channel
we should be mindful that the vast majority of current FDSN-format handling APIs (and software) deal with network, station, location and channel. All of those APIs are incompatible with selection based on different fields, which is a major wrinkle. Assuming we go with URIs for a new format, I assume we would, in time, evolve specifications and systems that allow selection based on URI. Until that is more prevalent, any changes to the 4-code identification system has significant impact beyond the time series data format.
Chad Trabant wrote on 27.01.2018 01:57:
I also like @jsaul <https://github.com/jsaul> proposal to split the channel code into multiple fields (if we break compatibility in any case I think this would be feasible): Band code (optional as not always sensible) Data quality/Data type code (observed [RDQM], derived, synthetic, ...) Instrument code Orientation Code
A small aside: I'm not in favor of dropping much information, but data quality codes are an exception. They really should be retired or mapped to version and/or retained in an optional field.
Come on, it's just one byte!... While at the same time there have been proposals to extend each of the band, instrument, orientation codes to more than one byte to make them future proof.
I agree that the quality code is not the strongest component of miniSEED. But it's there and the range of its values is far from exhausted. Therefore I believe that extending the scope from "quality of data" to "kind of data" would be appropriate to also include special kinds of data like synthetics or derived data without having to abuse any of the band, instrument or orientation code.
Honestly, I want to be able to produce 1-Hz, high-gain, vertical component synthetics with instrument response of an STS-1 and call them LHZ. Because it simply makes sense.
They are misnomers (they mean very little regarding quality) and are generally unimportant to users who just want the best/latest. Overloading them with derived or synthetic designators does not feel right, those feel more like an instrument, i.e. a generator of data.
But it exists already, even in miniSEED! There are plenty of possibilities to improve use of that code and even miniSEED users would benefit because extending the definition of that code doesn't have to break compatibility with the current standard and usage.
|FDSN:
_ . . | If we abandon the concept of |channel| we should be mindful that the vast majority of current FDSN-format handling APIs (and software) deal with network, station, location and channel. All of those APIs are incompatible with selection based on different fields, which is a major wrinkle.
Since you were quoting my proposal I would like to clarify that I never meant to abandon the concept of channel as it is now. In the course of the discussion about the format there were proposals to use special characters to separate fields, which would lead to a channel code that would have to be interpreted. A channel code where the position of e.g. the orientation code depends on the presence and value of other channel code fields. IMHO this would be much worse than the simple, 3-character notation that we have now, where we can rely on the orientation code being the last of exactly three characters. If there needs to be more space e.g. for the instrument code then a fixed two (or more) bytes should be allocated for the instrument code. But first please come up with a few good examples of instruments that are not covered by the current instrument codes.
But either way I think that a deviation from the current, well known and widely adopted 3-character notation would be too high a price to pay to solve a problem that doesn't even exist. Keep it as it is. It works so well.
Honestly, I want to be able to produce 1-Hz, high-gain, vertical component synthetics with instrument response of an STS-1 and call them LHZ. Because it simply makes sense.
"LHZ" means the data are from a high-gain seismometer, there is no indication that it may be synthetic. Regarding using the "quality code/record indicator" to indicate that kind of meta-typing: while I agree with the advantage of backwards miniSEED 2.x compatibility, it is problematic in that it 1) is rarely something users see (data centers would need to expose it a lot more) and 2) there is no connection with metadata. So, real clarity and usability issues.
On the other hand, a channel like "LXHZ" (or "L.XH.Z", however that ends up) could be recognized as a synthetic of a high-gain seismometer with much more clarity and fewer side effects for request formats, data summaries, data referencing, etc.
@chad-iris , I understand your concerns about the connection with metadata and the risk that users might not look at the quality code/record indicator and thus think this is a regular measured value. Thinking about this, maybe one issue is that there have been too much signification in the instrument code.
When you look at most of the instrument code, they could be associated to the physical measure and not necessarily to the the instrument that produces the values. Take for example the D code for pressure, you can use it for weather atmospheric pressure, precision micro-barometer, infrasound, sound, water pressure, vaccum pressure inside a sensor .... There is absolutely no indication on the instrument. Having H, L, M, N, G and P for ground displacement, velocity and acceleration depending on the instrument is one of the only examples of its kind.
On the other hand, a channel like "LXHZ" (or "L.XH.Z", however that ends up) could be recognized as a synthetic of a high-gain seismometer with much more clarity and fewer side effects for request formats, data summaries, data referencing, etc.
It could be a nice solution with one of the only disadvantage being to be limited to NGF.
Honestly, I want to be able to produce 1-Hz, high-gain, vertical component synthetics with instrument response of an STS-1 and call them LHZ. Because it simply makes sense.
"LHZ" means the data are from a high-gain seismometer, there is no indication that it may be synthetic.
That's why I support the proposal to re-define and use the 'quality' indicator for that, extending its scope to mean 'kind of data' rather than just 'quality of data'.
LHZ does not imply that the data were recorded with a seismometer. It only refers to the instrument response that was applied to the raw ground motion to turn it into the seismogram. The same can of course be done with a synthetic seismogram by filtering it with the instrument response LHZ refers to. Why not?
Regarding using the "quality code/record indicator" to indicate that kind of meta-typing: while I agree with the advantage of backwards miniSEED 2.x compatibility, it is problematic in that it 1) is rarely something users see (data centers would need to expose it a lot more)
This is true but it is also true that a lot of 'behind the scenes' support actually exists already. SDS, BUD, slarchive, etcpp. The fact that it hasn't been used very much up to now is due to the fact that it hasn't been used consistently. But we are talking here about improving things.
and 2) there is no connection with metadata. So, real clarity and usability issues.
The connection with metadata is via the channel code, see above. This is of course limited to instrument responses. But there isn't any more that you can do with the 'X' as part of the channel code because there are many different ways of how you could generate synthetic "LXHZ" data. How would you distinguish between reflectivity and WKBJ synthetics based on the 'X'? Surely a detailed description of how the synthetics were generated should at some point become part of the metadata but I don't see how the miniSEED incompatible 'X' would address this better than the compatible quality code approach. Neither of them can and to achieve it one would have to turn the whole channel naming upside-down.
One thing we may need to distinguish between is synthetic/derived data from a data center and that produced locally by an individual researcher. For data created by an individual for their own use, they can more or less do whatever they want. That is true now with miniseed and will be true in the future with NGF, there is nothing the FDSN can do about it. So calling a synthetic LHZ, if that makes them happy, is their own business. Data from a data center for distribution needs more care, and that is where some FDSN guidelines would be most useful.
Another issue is the overlap in our current identifier system between "identification" and "authorization". When data is distributed under a sncl, there is a tacit assumption that data labeled with station code XYZ came from or was created by, or at least was authorized by, the people that operate station XYZ. Once you go too far down the synthetic/derived/processed data route that is no longer true. In other words the sncl changes from meaning "data FROM station XYZ" to "data ABOUT station XYZ" and the distinction is important as distribution of synthetics and derived data becomes more prevalent, and especially when a synthetic might come from a different data center than the original data, meaning that even finding the corresponding metadata becomes harder. A simple bit to indicate "synthetic" really doesn't give enough information to even begin to link the data and metadata.
Yet another issue is that the current sncl system is used as namespace. So I know that I can safely add channel BHZ to my station CO.XYZ because I control the CO network code, and so as long as I am careful, my station, location and channel codes are globally unique. As soon as seismologist A starts creating synthetics for my station, they are inside of my namespace, and the global uniqueness is lost. Even worse if seismologist B,C and D do the same and we have channel name collision with no hope that our sncl identifier really actually identifies anything.
Moreover, there is the assumption that if a datacenter distributes timeseries data, there exists corresponding metadata for the sncl, and I do not believe stationxml is really capable of describing a synthetic, processed or derived channel at present, except in very generic placeholder terms like with a unity response and a comment. For synthetic data with the same sncl as the original data, the metadata does NOT correspond.
I don't have answers to these issues, but think that the channel identifying scheme, however it changes (or doesn't change) from the existing sncl, should keep these things in mind. My opinion is that a simple sncl, while fine for raw recorded data, is not sufficient to address these needs, even with expanded number of characters and a data quality byte. At a minimum there probably needs to be some higher level namespace, probably above network, to differentiate data center A's synthetic seismogram from seismologist B's derived/processed seismogram from the network operators original timeseries. We need to capture both the notions that "this timeseries was created by" along with "this timeseries is related to but not the same as that timeseries".
The good news is that a URL type identifier probably provides enough flexibility to address these needs. We just have to figure out how.
Whew!
Once you go too far down the synthetic/derived/processed data route that is no longer true. In other words the sncl changes from meaning "data FROM station XYZ" to "data ABOUT station XYZ" and the distinction is important as distribution of synthetics and derived data becomes more prevalent, and especially when a synthetic might come from a different data center than the original data, meaning that even finding the corresponding metadata becomes harder. A simple bit to indicate "synthetic" really doesn't give enough information to even begin to link the data and metadata.
I agree that finding the corresponding meta-data is hard but #10 would at least enable the distinction between "FROM station X" and "ABOUT station X". Tools could automatically set these flags any time some modified data is stored.
I think that uniquely identifying things that do not originate from data centers is hard (might be possible with full URI style identifiers but which authority is supposed to resolve these?) and potentially not in the scope of the FDSN identifiers. Thus just flagging data that it is not from a data-center would be a definite improvement to the status quo.
Some shameless self-advertisement and potentially something to draw inspiration from - we've been working on this a while ago: http://seismicdata.github.io/SEIS-PROV/index.html Its basically a generic provenance description of seismic data. Association happens the other way around in the sense that the data stores the identifier of its provenance entity. This would be possible with custom headers in NGF.
LHZ does not imply that the data were recorded with a seismometer. It only refers to the instrument response that was applied to the raw ground motion to turn it into the seismogram. The same can of course be done with a synthetic seismogram by filtering it with the instrument response LHZ refers to. Why not?
It could be done but I've only very rarely seen this - mostly people deconvolve the response from the data and then compare observations with synthetics in some physical units.
When you look at most of the instrument code, they could be associated to the physical measure and not necessarily to the the instrument that produces the values. Take for example the D code for pressure, you can use it for weather atmospheric pressure, precision micro-barometer, infrasound, sound, water pressure, vaccum pressure inside a sensor .... There is absolutely no indication on the instrument. Having H, L, M, N, G and P for ground displacement, velocity and acceleration depending on the instrument is one of the only examples of its kind.
This would indeed be nice and I would support this.
I think in general we have to be careful to not assign too much meaning to the identifiers - this is what the meta-data is for after all. But the current system of "meaningful" FDSN identifiers (in some sense limited meta-information) have proven to be incredibly useful and are widely accepted. Most people in our community can just look at the channel codes and already pretty accurately tell what kind of data they represent. A lot of tools and scripts are also based on the semantics of these codes. So I feel like regardless of what is being decided on in the end - this has to be retained in some fashion.
If we conclude that the community can swallow actual change I gather from this discussion that we would like to distinguish five different characteristics (assuming that net/sta/loc basically are spatial designators):
These could all be separate identifiers. This would make the full FDSN identifiers a bit more complex but also a lot more expressive and thus it might be a reasonable compromise.
This conversation makes me go back to thinking what the format was really intended to do.
Original SEED was to act as an exchange format for observed ground motion data.
The utility of using SEED to carry derived data such as synthetics seemed compelling and we could do it within the existing naming conventions with some trouble but it kind of works. But the method we used to get synthetics into miniSeed has known limitations, again with the number of variations it can support. No need to go into details.
I think the NGF miniSeed will have utility for many different domains that none of us on this thread even know how they choose to identify their equivalent of time series identifier. We shouldn't pretend we can come up with a general solution that works for everything.
So instead of trying to come up with an infinite variety of schemes down at the network station channel location level why not do something that extends the usability to any field.
I think we should focus on the namespace capability, FDSN: should be reserved for "primarily" ground motion data or other instruments we are familiar with such as BH. LH, EN, etc. It can also be used as it has with a variety of other instruments for which we have data.
For synthetic data generated within the FDSN why not create a new namespace such as FDSNSYN: many of the issues being discussed go away. Similarly if one wants to process original time series data they could create a FDSNPROC:
If the geodetic community wants to use NGF for Geodetic data they could use the GEODETIC: namespace. I think moving in this direction eliminates the need to get too worried about some of the complications people were starting to identify. If a data center sees a namespace they do not understand they simply ignore it in their archiving or request processing or figure out what they want to do. Moving the issues into a new namespace approach allows us to get back to the bigger FDSN issue.
One other comment about channels. The comment about Pressure (D) instrument code not really working is a bit more complicated, When the D code is invoked and since pressure has no direction the orientation code was grabbed in order to really talk about different kinds of pressure instruments. This alone shows the need for expanding the channel code. Joining the instrument code and the orientation code was pragmatic but was one of those messy things we had to do long ago.
Assuming the FDSN identifiers will be used in the new data format (please discuss this in #4) how should the channel code be expanded (or not) and what conventions should be adopated (if any)?