Overall record size - Githubissues

djeastonca commented 2 years ago

In most cases when evolving a data format an increase in space requirements is inevitable; at the same time it is recognized that storage density continues to increase and corresponding costs continue to decrease. However it is important to consider that there is still a cost to incur, and this cost is incurred on low power dataloggers producing the data, the data center storing and distributing the data, the computers consuming the data, and the transmission of the data on the networks connecting all of these. A technical review of the proposed data format is an opportunity to take a step back to consider whether the technical proposal is cost effective overall - in other words whether the level of cost optimization in the design undertaken for the new data format is suitable.

The space requirements associated with a few aspects of the new data format can be considered via a few examples:

Source identifier: a text-based, variable length field
Record start time: The original straw man proposal specified 8 bytes and the current proposal specifies 12 as an outcome of a brief discussion thread
Record length: A proposal to reduce this in half to 2 bytes was not successful
Sample rate/period: 8 byte float

In the case of the Source Identifier field, the need to expand the namespace is clear, the approach taken here is mostly consistent with today's conventions, and the benefits of an expanded namespace will be constantly leveraged going forward.

The original arguments put forth in support of moving to an 8 byte time representation are compelling and, in the years since, momentum has continued to gain with regard to dispensing entirely with creating new leap seconds, and the rationale for this includes a recognition that the world is unprepared for a possible negative leap second ("What could possibly go wrong?" ;-))

With regard to the remaining 2 examples cited, is there too much weight being placed on possible edge cases justifying a 4GB record length and the very high level of precision offered by a 64-bit float for the sample rate? It is hard to predict the community's needs over the fullness of time, but the new data format is designed to be more easily extended in the future as new significant use cases emerge.

crotwell commented 2 years ago

My memory is that the 8-byte sample rate was desired by the OBS community as they often have very long timeseries with a GPS time stamp only at the beginning, before deploying and the end when it pops back up to the surface. They argued that the precision of a 4-byte sample rate would introduce errors for a potential many month to year long deployment cycle. I have no direct knowledge of how time on OBS systems work, so can't judge that need, but if they say they need higher precision, I am inclined to say the extra 4 bytes are worth it.

There were several comments early in the mseed3 development arguing strongly for a year-day-hour-min-sec style time instead of a simple offset from "zero time". Personally, I felt that the argument against a 8-byte double offset were not all that convincing, but the extra 4 bytes also did not seem that much of a waste either. There is I suppose some advantage in not having to count leap seconds to decide what time the double value means.

I still feel the argument to allow larger records, particularly in the research/postprocessing side of seismology is compelling. There is just a lot of advantage in being able to save asingle contiguous time series as a single record when doing research. That said, I would fully expect that data loggers would only generate small records and data centers would probably refuse to accept records above a given size, much as they do now. And would also note that in miniseed2, a record currently can be as large at 2 to the 256 power, with is just crazy big. The extra 2 bytes feel worth it to me. Also, while probably not a strong argument, the current layout also keeps word alignment. My understanding is that that is no longer that big of a deal for modern systems, but it is kind of nice.

Lastly, I have long had the opinion that mseed3 should NOT be optimized for low latency, near real time, earthquake early warning style data delivery. That seems to me to be a very specialized use case that really should have its own specialized protocol that is tuned not only to the data logger but also the transmission technology and is designed to minimize any and all redundant data across all of the multiple simultaneously recorded channels. While smaller is better in general, I guess I just don't see a few bytes in the header as being that significant for any other use case.

My $0.02 for whatever it is worth...

djeastonca commented 2 years ago

The extra storage needed for some of the individual fields is modest but the extra storage for these fields, in the aggregate, makes a noticeable difference to overhead as the number of record headers in a given file/archive accumulates. I believe that a technical review of the proposed design should recognize that the design is not as lean as it could be, but at the same time also summarize the benefits offered by the community to make the trade-offs clear.

@crotwell I fully agree with your opinion about the applicability of the mseed3 design to applications beyond archiving (e.g. low latency/EEW). The motivation for raising this issue is only to ensure that we recognize from a technical perspective that there is an impact on storage costs for the dominant use case associated with mseed3 - storage/archiving.

chad-earthscope commented 2 years ago

I think you both have covered this aspect well, recognition that it's not as lean as possible and rationale for the balance being struck. On the one had the fixed header is the same size as TCP headers, but when you've got a lot of them (100s of billions in the IRIS archive) it really adds up!

From a design perspective a case that was not a primary target and could be useful, for low latency/EEW and extended data types, is multiplexed payloads. I'm hopeful we can still get there with alternate identifier schemes and payload definitions.

FDSN / miniSEED3

Overall record size #17