Requirement: A new method of identifying a time series

krischer commented 6 years ago

A new method of identifying a time series is required for NGF: It should be adequate to meet the need to deploy multiple sensors, retain semantic meaning where possible and support a significant increase in the number of sensors deployed as a single project.

Suggestion: Use a variable length time series identifier allowing significantly more flexibility in the way individual elements can be identified.
Suggestion: Create an FDSN name space allowing straightforward adoption by other communities in a simpler and more decoupled manner.

chad-earthscope commented 6 years ago

I am strongly in favor of defining a single time series identifier for two main reasons:

We would only need a single, likely variable length, field with an "identifier" in the format. A (large) maximum length would be the only limiting factor defined by the low level format. Then discussions about how the identifiers are constructed and how codes can be expanded are independent of the low level format. We could, of course, have variable length fields for each identifier, but that adds complexity that is unnecessary in my opinion.
Such an identifier provides a huge degree of flexibility for currently needed and future changes. We will be able to expand, for example, channel codes without a limit imposed by the low level format. Furthermore, if we incorporate a "namespace" into the identifier we then allow the possibility to adopt evolved or complete different identification schemes in the future.

While this would set the stage for having a complete new identifier scheme, we have a large volume of miniSEED 2.x that uses the old identifiers that must be accommodated. To this end, IRIS proposed a solution during the previous technical discussion, which has been modified slightly since then, to define an identifier constructed as a Uniform Resource Name (URN) with the following pattern:

FDSN:<network>_<station>_<location>_<channel>

where the network, station and channel codes are required to be non-empty and the location code may be empty. The 3 underscore (ASCII 95) delimiters must always be present.

Example identifiers: FDSN:IU_COLA_00_BHZ (where network=IU, station=COLA, location=00 and channel=BHZ) FDSN:NL_HGN__LHZ (where network=NL, station=HGN, location is empty and channel=LHZ)

The "FDSN:" namespace would identify this combination of SEED codes. Alternates schemes could be defined or adopted in the future.

For reference the current working draft of this proposal is attached: FDSN Identifiers - 2018-1-3.pdf

There are a few areas that are known to still need work:

Rules/convention for location codes. This is not fully defined in the working draft and needs further discussion.
The draft includes our concept of 3 or 4 character channel codes where an extra character could be used to identifier more instrument types. It is unclear whether this concept can accommodate all needed instruments.

crotwell commented 6 years ago

I am also in favor of the single identifier. It has many advantages, but also a couple disadvantages worth noting.

An advantage of a single identifier is that the likely most common operation on miniseed data, matching of records with channels, is a single string comparison instead of 4 currently. The truncation of null bytes or searching for the '~' step is also reduced from 4 to 1, perhaps making extraction quicker.

But a single identifier will use additional bytes. The IRIS proposal for example has effectively 8 extra bytes compared with existing miniseed2. Also, extraction of the network or station code becomes a more expensive string splitting operation. I think the tradeoff is acceptable, but the disadvantages are worth noting.

Instead of "FDSN:" perhaps defining that the stored version identifier that starts with ':' implies 'FDSN:' would save 4 bytes. All other non-fdsn namespaces would have to be fully specified.

jfclinton commented 6 years ago

@chad-iris

For reference the current working draft of this proposal is attached: FDSN Identifiers - 2018-1-3.pdf

is there going to be an opportunity to comment on this working draft? If the consensus is to move to a single identifier that simply extends SNCL, the actual proposed extension requires significant discussion [1]. Does it fit into this stage of the discussion, or when would you see this taking place?

John

[1] e.g. Generally I agree with what is being proposed in the working draft, though I think the channel code could be extended beyond 4 characters to add some additional information about synthetic / processed data without having to scamper off to an alternative URL. For example, it would be useful to an additional new data type code to indicate whether data is raw / processed or synthetic (default is raw). Then e.g. a synthetic BHN stream can be identified as BHN-X or a strong motion channel converted to acceleration can be identified as HGZ-Y. At the moment, using X as the band code looses all this information for processed / synthetic streams.
Also, since we have the opportunity, I would propose considering splitting the band code to separate out the ranges of sampling rates and the ranges of sensor corner frequencies.

chad-earthscope commented 6 years ago

@jfclinton:

is there going to be an opportunity to comment on this working draft? If the consensus is to move to a single identifier that simply extends SNCL, the actual proposed extension requires significant discussion [1]. Does it fit into this stage of the discussion, or when would you see this taking place?

Yes, absolutely. From the NGF perspective, the important part is whether we agree to this kind of identifier. If we collectively agree, then the definition of NGF can move forward (imposing, perhaps, only a maximum identifier length of 255) while the discussion of what form the extensions take being split-off into a separate conversation.

I suggest one of these options:

create another space to discuss the form of the identifier and expansion of the codes independently of NGF. This could just be another GitHub project under the FDSN account.
use this issue to gauge acceptance of a SNCL-mappable, single identifier form. When the evaluation is closed at the end of the month, if member consensus is to use such a form we create another space to discuss the form of the identifier and the expansion of the codes.
use this issue to discuss single identifier form and expansion of codes.

Those are in my order of preference. I volunteer to create another GitHub project for FDSN identifiers and create issues to discuss form and expansion and rules for each of the 4 codes (network, station, location, and channel) if there is agreement to do this.

Even if the consensus is to keep 4 fields for each of the 4 codes we need to discuss how to expand them.

Expanding the codes is a very important topic, of all the changes we are discussing it is the one that will effect end-users the most in my opinion. It merits a separate conversation that is not muddled with the rest of the NGF details.

jfclinton commented 6 years ago

Expanding the codes is a very important topic, of all the changes we are discussing it is the one that will effect end-users the most in my opinion. It merits a separate conversation that is not muddled with the rest of the NGF details.

I agree its very important, and it also couples into some of the discussions in other conversations, e.g. if we agree that identification of processed data is part of the new naming convention, then we can agree and close #10 , so we need to begin discussing it soon. I propose we move forward on this with Chad's first suggestion - its going to get too messy to fold this entire topic into the single issue here.

crotwell commented 6 years ago

Can we keep it here instead of a second github? Add as many issues as you think you need, but following similar discussions arbitrarily split into 2 repositories makes it harder to follow I feel.

krischer commented 6 years ago

Can we keep it here instead of a second github? Add as many issues as you think you need, but following similar discussions arbitrarily split into 2 repositories makes it harder to follow I feel.

Done in #27-#30.

tim-iris commented 6 years ago

I think that this is a key topic and I do think that the 4 key fields that correspond to fields in miniSeed2 are still the correct ones. Since there is flexibility in how many actual characters can be used for each field in general this could result in space savings even with the added field separators. Also I do not think that the size of the combined identifier is that important and would not make that an issue. Life today would be simpler if the original miniSeed had not been so stingy. As time passes, lengths, bytes, and such things that relate to the size become less and less important.

In general I am in favor of the time series construct proposed by Chad above

tim-iris commented 6 years ago

I think there should be some discussion related to the Channel field since it is really trying to specify three different attributes of a channel in a single field. Would it make sense to break out the current three fields separately into BandCode, Instrument code, and orientation. It would give greater flexibility than keeping them together as one. Users could still specify things such as BHZ but the interfaces would map those into B_H_Z for instance for query processing. Users might not be impacted but data generators could have greater flexibility and capability.

kaestli commented 6 years ago

I read we all agree that

we would maintain the concepts of NET, STA, LOC & CHAN to give a stream an fdsn-defined context (DESCRIPTION), while expanding each of these descriptors
for IDENTIFICATION, we would like to have a single URI-style identificator.

Now, if we define "our" FDSN URIs as FDSN:_, we cannot use our NET, STA, LOC and CHAN (SNCL) descriptions, and use the additional flexibility of URI identifiers at the same time: if we use an FDSN: URI, it is already fully defined by the SNCLs, and if you switch to another style of URI, you cannot provide SNCL context any more (or, at least, not in a defined way)

in order to correct for this, I would propose to allow more noise on the FDSN-style URIs, e.g. by

defining FDSN stream identifiers as: FDSN:?sncl=_[&] , with = /[anything]

e.g.: FDSN:ch.ethz.sed/streams?sncl=CH_DAVOX__HHZ&version=2

"FDSN:" - fixed "ch.ethz.sed" - standardized institutional prefix (referring to the well established DNS standard), widely ensuring global uniqueness of identifiers even without centralized registry "/streams" - may be anything the institution adds for more distinctiveness, or nothing at all "/&version=2" may be anything the institution adds for more distinctiveness, or nothing at all (both just following URI standard, but without standardized intrinsic meaning)

crotwell commented 6 years ago

"FDSN:ch.ethz.sed/streams?sncl=CH_DAVOX__HHZ&version=2" is 53 bytes, which is bigger than the entire header for many miniseed2 records. That seems a bit excessive.

While more structure and flexibility in the identifier is a good thing, it has to be weighed against the cost of the overhead, especially since it will be repeated in every single NGF record.

kaestli commented 6 years ago

@crotwell in a legacy environment, "FDSN:ch.ethz.sed?sncl=CH_DAVOX__HHZ" would be completely sufficient (36 bytes) :-). For efficiency (in both transfer and storage), it is much more important not to be tied to small records (many headers per data)

I don't see any argument against typically choowing larger records (e.g. 4096 bytes, consitent with physical sector size of many media) in stored data. For streamed data, it is important that the format allows incremental writing and (tentative) reading of records; this relieves also streaming from the requirement of very small records and many repetitions of header information (see my comments on #25 ) Note that in streaming, you have also TCP overhead (40+ bytes in case of IPv4, 60+bytes for IPv6) for each data unit transferred (be it a record, a fraction of a record, or, in the most extreme case, a single sample). This adds a bit of perspective to the discussion of the header size (transferred only once per record) in a streaming application.

krischer commented 6 years ago

Summary

(Please let me know if I missed a point or misunderstood something)

There seems to be consensus to using a single time series identifier in the approximate form of FDSN:<network>_<station>_<location>_<channel>. Details to what each of these mean (and if all 4 or even more are needed) are discussed in #27-#30. This issue is purely about using a single but very flexible namespaced string identifier.

Please vote on the following issues:

Are you in favor of adapting a single string based and namespaced time series identifier? (Yes/No)
Should the namespace field be mandatory? (Yes/No - otherwise it would default to the FDNS: namespace).
What should the maximum length of the identifier be? (255 bytes/propose other)
What should the text encoding be? (UTF-8/ASCII/propose other)

chad-earthscope commented 6 years ago

Are you in favor of adapting a single string based and namespaced time series identifier? (Yes/No)

Yes.

Should the namespace field be mandatory? (Yes/No - otherwise it would default to the FDSN: namespace).

Yes. This is critical for providing future ability to create other identifiers. I do not believe all FDSN identifiers need to go under FDSN: namespace, the FDSN can create other name spaces.

What should the maximum length of the identifier be? (255 bytes/propose other)

Length stored in a single byte and has plenty of headroom for future needs.

What should the text encoding be? (UTF-8/ASCII/propose other)

The ASCII subset used for SEED 2.4 plus a few extra characters already proposed.

The transition to a URN-style identifier with a name space puts us on the path for creating new identifiers in the future that support a broader encodings to full UTF-8, but there are a lot of changes to systems and implications for usability if we did that now.

krischer commented 6 years ago

The transition to a URN-style identifier with a name space puts us on the path for creating new identifiers in the future that support a broader encodings to full UTF-8, but there are a lot of changes to systems and implications for usability if we did that now.

This is to some degree independent of the format. Each namespace (depending on how we do it) could still allow only a subset of what the format itself can store. But if we choose anything "less" than a UTF variant we limit the format and moving to an UTF encoding would require a new revision of the core data format.

chad-earthscope commented 6 years ago

The transition to a URN-style identifier with a name space puts us on the path for creating new identifiers in the future that support a broader encodings to full UTF-8, but there are a lot of changes to systems and implications for usability if we did that now.

This is to some degree independent of the format. Each namespace (depending on how we do it) could still allow only a subset of what the format itself can store. But if we choose anything "less" than a UTF variant we limit the format and moving to an UTF encoding would require a new revision of the core data format.

I would think we could define a new identifier type and use it with the same core format just like we can define a new encoding type and not change the core format. For example, a "FDSN-U8:" namespace could be created in the future to have some kind of identifiers that allow UTF-8, the format does not need to change. Just like with encodings, it's the readers that need to support those new variations.

krischer commented 6 years ago

This works as long as the namespace itself is limited to some defined text encoding. But this is likely only an academic and not a practical problem.

kaestli commented 6 years ago

1 - YES 2 - YES 3 - 64k (2 Bytes length indicator); with recommendation to stick to 2k. Rationale: with 1 Byte length (255 chars) many other URI schemes, e.g. http, cannot be leveraged. 4 - ASCII with %-Escaping (follow chapter 2 of W3C RFC 3986 on URI - see https://tools.ietf.org/html/rfc3986#section-2.1)

crotwell commented 6 years ago

1 yes 2 yes - although I feel that 99% of use will be fairly standard sncl and so this effectively wastes 5 bytes per record 3 255 4 utf8 - but I think given the character limitations in #27-#30 this is effectively the same as ascii but may give some future flexibility for the future. I would also say defer to existing standards wherever there is conflict, ie rfc3986.