DAS-RCN / RCN_DASformat

4 stars 1 forks source link

Appeal to reconsider time data type #1

Closed rcasey-earthscope closed 1 year ago

rcasey-earthscope commented 1 year ago

Thank you for sharing this draft of the IRIS DAS format. It is exciting to see this taking shape. I saw something right off in the README that gave me pause, as I have seen this happen in other code to not-so-good effect in time series data.

t0 UNIX time stamp of first sample in file type=float64 dt Spacing between samples in seconds (i.e. inverse of the sampling rate) type=float32

Storing time as floats, even doubles, is a recipe for big problems. The biggest issue with floats is that they cannot sustain precision with increasingly large values. Even at millesecond levels, you begin to see a degradation in precision, and with DAS, we will need at minimum nanosecond precision. I would argue for pico-second precision if at all possible.

You have to go with long integers (int64) as the solution to this problem. In addition, you will want to address the time range that a 64 bit encoding can cover at your chosen level of precision so it will stand the test of time. This first requires that you consider what you want to consider zero-date. Since DAS didn't exist in 1970, there is no need to have that be the start date. Also, knowing where the earliest DAS data could begin means that you can make your long integer unsigned, which is a huge benefit for range.

You also can't compare floats easily, but you can always compare integers. Integer operations are also more efficient on CPUs. However, the most important benefit is your step-wise precision from sample to sample. Floats will lose in this regard, resulting in dropped samples or incorrect cut offsets.

But don't take it from me, there is a good article here that describes the problem further. Maybe I am misinterpreting the use of t0 and dt, since I haven't reviewed the code yet, but I wanted to call this out. Thanks!

https://randomascii.wordpress.com/2012/02/13/dont-store-that-in-a-float/

-Rob

andreas-wuestefeld commented 1 year ago

Thanks Rob,

Good point. So you suggest to follow the numpy.datetime64 approach of uint64 in nano-sec? I would still probably stick to UNIX time-zero, just to aviod introducing yet another time base:-)

rcasey-earthscope commented 1 year ago

datetime64 is a good basis to go off of. I did briefly look at the code and saw the creation of t0 in the example from datetime.timestamp(), which returns a float much like time.time(). There is a remedy for nanosecond precision in time.time_ns() that returns an int.

https://docs.python.org/3/library/time.html#time.time

However, datetime64 is certainly what I would recommend if not going with a uint64.

Thanks for writing back!

andreas-wuestefeld commented 1 year ago

do you have an opinion on dt vs fsamp?

rcasey-earthscope commented 1 year ago

provided in your Issue #2

miili commented 1 year ago

In general I agree using float64. But I oppose np.datetime64, as the format will be limited to the Python world.

Consider using ISO8601 dates https://en.wikipedia.org/wiki/ISO_8601. The standard says:

If necessary for a particular application, the standard supports the addition of a decimal fraction to the smallest time value in the representation.

Comparison for those data types are awful however, opting for a float in Unix epoch is not the worst idea.

As for sampling_period or sampling_frequency one can consider using float128, then a potential format would cover kHz/MHz applications.

rcasey-earthscope commented 1 year ago

Thank you for pointing out the python lock in, miili. datetime64 is great for python code, but would probably not work well as a portable representation. Java, C, Rust, Go, Javascript, what have you. All of these languages will get used.

float 128 is non-conventional and probably too large. Using int64 or uint64 strikes the right balance for precision, range, and size. I think floats should be left to addressing calculations where precision level is accepted and not used as a minted representation in a data format when it comes to time (or other large and precision measurements).

andreas-wuestefeld commented 1 year ago

t0 is now a uint64 in nano-seconds

andreas-wuestefeld commented 1 year ago

Thanks Miili for pointing out ISO formats. I saw it only now after working on the uint64 implementation I feel this is still language agnostic and avoids issues with strings. And since there are many options on ISO format, this may be confusing

rcasey-earthscope commented 1 year ago

I consider this issue closed. Thank you.