PeytonT / serde_ion

Rust implementation of serialization and deserialization of the Ion data format
Other
6 stars 0 forks source link

Timestamp representation/validation #12

Open siler opened 4 years ago

siler commented 4 years ago

Hiya.

We'll need to figure out how to deal with Ion timestamps. Here's the simple definition of an Ion timestamp from the spec:

Timestamps represent a specific moment in time, always include a local offset, and are capable of arbitrary precision.

Arbitrary precision is the interesting part. A valid Ion timestamp can be a year (e.g. 2007T), at minimum precision. There is no maximum precision for fractional seconds.

The set of requirements look something like this:

Goals

Ideally, we get out of the way when dealing with timestamps. Also ideally, people tend to use certain ISO 8601 serializations (that is, I would suggest that second and millisecond are probably the most common precisions, though I don't have data to back this up).

I would like to provide a clean way for clients of the library to extricate timestamps to their favorite crate. However, this is a non-trivial problem as no crate currently provides arbitrary precision timestamps so the behavior for how this should be done is largely up for debate.

A non-goal is fully integrating a date/time library (though it could be?).

Approach

It feels beyond the scope of this library to implement a fully functioning arbitrary precision timestamp toolkit, so I think we should try to pick a crate that can do most of the heavy lifting for us if possible. Unfortunately, a crate for that use case doesn't exist yet. We can extract parts that are useful (such as the Date and UTC offset in #11).

We could go further and use the Time and DateTime types internally, however this is where arbitrary precision begins to rear its ugly head. It might be possible to work around this where precisions match types available in libraries, but the question is what do we do when they don't? A trait might be a good solution here, allowing the client to define the binding to the timestamp library.

Date and Time Crates

After a brief survey of the date and time crates, there are only a couple of options:

Both seem to be maintained and frequently used. Chrono is a superset of time (i.e. it depends on time and adds an extended API as well as some other features). Chrono also includes a list of limitations.

The chief limitation seems to be arbitrary precision fractional seconds. Another potential limitation, depending on if the requirements are for a four digit year or an arbitrary precision year, is the +-262,000 year limitation. I'm not sure that this matters if we are only using the library for validation and internal storage of a UTC offset.

That's probably enough thoughts on timestamps for now.

PeytonT commented 4 years ago

Ideally, we get out of the way when dealing with timestamps.

Agree. Time is weird and hard and Ion is a data format, so ideally we just surface correct data and let the user figure it out.

Also ideally, people tend to use certain ISO 8601 serializations (that is, I would suggest that second and millisecond are probably the most common precisions, though I don't have data to back this up).

This matches my experience.

I would like to provide a clean way for clients of the library to extricate timestamps to their favorite crate. However, this is a non-trivial problem as no crate currently provides arbitrary precision timestamps so the behavior for how this should be done is largely up for debate.

I think we could support this with crate features for TryInto impls between Timestamp and the significant time crates. Arbitrary precision is thorny, but we could just have the TryInto fail if the target library doesn't have a type with matching precision.

A non-goal is fully integrating a date/time library (though it could be?).

Probably not, given the non-overlapping behavior between Timestamp and chrono/time.

It feels beyond the scope of this library to implement a fully functioning arbitrary precision timestamp toolkit, so I think we should try to pick a crate that can do most of the heavy lifting for us if possible.

Very much agree.

We could go further and use the Time and DateTime types internally, however this is where arbitrary precision begins to rear its ugly head. It might be possible to work around this where precisions match types available in libraries, but the question is what do we do when they don't?

I think the best bet is probably to check as much as we can. Timestamp doesn't respect leap seconds, so I think all of the validation that might actually cause use to reject a Timestamp can be done with Date/Time.

The chief limitation seems to be arbitrary precision fractional seconds.

Since fractional seconds can't cause the Timestamp to be invalid I don't think we have to worry about this, aside from the fact that it prevents us from defining Timestamp as a wrapper around Date and Time.

Another potential limitation, depending on if the requirements are for a four digit year or an arbitrary precision year, is the +-262,000 year limitation. I'm not sure that this matters if we are only using the library for validation and internal storage of a UTC offset.

I believe this is not an issue. Notably, Ion doesn't support moments in time prior to Jan 01, 0001, so we don't have to worry about the negative side of things. Also, by construction from these two clauses we can deduce that Ion doesn't support years past 9999.

  1. All three views are semantically isomorphic, meaning they can represent exactly the same data structures, and an Ion processor can transcode between the formats without loss of data.
  2. In the text format, timestamps follow the W3C note on date and time formats
PeytonT commented 4 years ago

This is actually the first time I puzzled through the valid Year values for Timestamp, and it leaves me pondering why, given that this number is restricted to the range [1, 9999], it is represented in the binary encoding with a VarUint? I suppose it allows years in the range [1, 127] to be represented with only 1 byte.

siler commented 4 years ago

I think we could support this with crate features for TryInto impls between Timestamp and the significant time crates. Arbitrary precision is thorny, but we could just have the TryInto fail if the target library doesn't have a type with matching precision.

Yeah, that seems reasonable.

We could go further and use the Time and DateTime types internally, however this is where arbitrary precision begins to rear its ugly head. It might be possible to work around this where precisions match types available in libraries, but the question is what do we do when they don't?

I think the best bet is probably to check as much as we can. Timestamp doesn't respect leap seconds, so I think all of the validation that might actually cause use to reject a Timestamp can be done with Date/Time.

The chief limitation seems to be arbitrary precision fractional seconds.

Since fractional seconds can't cause the Timestamp to be invalid I don't think we have to worry about this, aside from the fact that it prevents us from defining Timestamp as a wrapper around Date and Time.

I agree on checking as much as we can.

As far as internal representation goes, we'll have to create the time objects anyway to check Dates and Times, however I think the internal representation would be nicer as a simple data object as originally existed and like the idea of a TryInto for chrono's DateTime. This is safer (from a support standpoint), and would be ideally used as a feature flag (on by default?). That way we can also use time internally for validation which is a little more lightweight. If we go this approach, I would likely suggest using a i32 for offset, which is what UtcOffset uses internally, so we don't expose time on our public API.

Do we want to pare the representation sizes down to what can be represented within the timestamp spec/what time uses?

Also, how did you feel about the more hierarchical type structure (TextTimestamp)?

PeytonT commented 4 years ago

This is safer (from a support standpoint), and would be ideally used as a feature flag (on by default?).

SGTM

I would likely suggest using a i32 for offset, which is what UtcOffset uses internally, so we don't expose time on our public API.

Agreed.

Do we want to pare the representation sizes down to what can be represented within the timestamp spec/what time uses?

Yes, a lot of things get simpler now that I know from reading the text specification that years are bounded in the range (1, 9999). The Timestamp enum as currently defined just matches the raw structure of binary encoded timestamps, but that was not intended to stick around once timestamps were validated. I've been sandbagging changes to avoid merge conflicts for the large PR that's outstanding.

Also, how did you feel about the more hierarchical type structure (TextTimestamp)?

I don't think that the variants of Value should manifest aspects of the text/binary encoding, since they are meant to represent the pure data model. (Ideally invalid data like bad timestamps would be unrepresentable in the data model types we provide, but this is tricky without a lot of effort or better const support.) But I don't think there's any problem with defining other types like TextTimestamp elsewhere, and providing From impls as a convenience to users.

siler commented 4 years ago

Do we want to pare the representation sizes down to what can be represented within the timestamp spec/what time uses?

Yes, a lot of things get simpler now that I know from reading the text specification that years are bounded in the range (1, 9999). The Timestamp enum as currently defined just matches the raw structure of binary encoded timestamps, but that was not intended to stick around once timestamps were validated. I've been sandbagging changes to avoid merge conflicts for the large PR that's outstanding.

Cool, I've added an issue: #14.

Also, how did you feel about the more hierarchical type structure (TextTimestamp)?

I don't think that the variants of Value should manifest aspects of the text/binary encoding, since they are meant to represent the pure data model. (Ideally invalid data like bad timestamps would be unrepresentable in the data model types we provide, but this is tricky without a lot of effort or better const support.) But I don't think there's any problem with defining other types like TextTimestamp elsewhere, and providing From impls as a convenience to users.

Apologies for being unclear, I meant as a contrast to the flat layout of Timestamp.

PeytonT commented 4 years ago

as a contrast to the flat layout of Timestamp

I prefer the flat layout for Timestamp, since it matches the pure data model, but I suspect that it would be cumbersome to write application code with it, since there are so many enum variants and so many fields in some of the variants. I think the Date/Time types look useful for the end-user.

PeytonT commented 3 years ago

Issues around timestamp representation cropped up again recently.