beacon-biosignals / TimeSpans.jl

A Julia package that provides a `TimeSpan` type for representing a continuous span between two points in time.
Other
6 stars 2 forks source link

user-selectable rounding strategies for time <-> index conversion? #26

Closed ericphanson closed 1 year ago

ericphanson commented 2 years ago

Consider a signal with sample rate 1 Hz.

Index       1   2   3   4   5
Time (s)    0   1   2   3   4

Now, consider the time span 1.5s to 2.5s. Using brackets to highlight this span:

Index       1   2     3     4   5
Time (s)    0   1  [  2  ]  3   4

What indicies should be associated to this span?

To me, looks like index 3.

But:

julia> TimeSpans.index_from_time(1, TimeSpan(Millisecond(1500), Millisecond(2500)))
2:2
kleinschmidt commented 2 years ago

The docs are pretty clear on this at least for the start index:

integer index of the most recent sample taken at sample_time

So at time 1.5s, the most recent sample taken at that time is index 2.

At tiem 2.5s the most recent sample is index 3. I guess then the right-closed definition gives you 2:2, maybe that's where the issue is?

kleinschmidt commented 2 years ago

I think the solution might be to convert the final index back to time, and see if it needs to be adjusted down by 1 (e.g., if time_from_index(rate, index_from_time(rate, stop(span))) == stop(span) then you use j-1, otherwise use j

hannahilea commented 2 years ago

Related to https://github.com/beacon-biosignals/TimeSpans.jl/issues/20.

@ericphanson and I discussed this irl, and determined that, similar to #20, this is an example of pre-specified rounding behavior (in https://github.com/beacon-biosignals/TimeSpans.jl/blob/eb8b7949b3bc4b82c7f176f3b5227a48c08f4dce/src/TimeSpans.jl#L215, https://github.com/beacon-biosignals/TimeSpans.jl/blob/eb8b7949b3bc4b82c7f176f3b5227a48c08f4dce/src/TimeSpans.jl#L237) of times that do not line up exactly with index times.

We propose that we disallow calling index_from_time on any ::Period that does not fall exactly on a sample. We would then provide an additional set of round_to_sample_time(sample_rate, ::Union{Period, TimeSpan}, mode) helper functions that, given a time (or span) and rounding mode, would return time(s) that fall on exact samples. Possible desired rounding options:

In this new paradigm, current invocations of index_from_time(sample_rate, span) would need to be updated to some flavor of

index_from_time(sample_rate, round_to_sample_time(sample_rate, span, mode)).

We could then provide a deprecation path to the current index_from_time functions that falls back to a round_to_sample_time that persists the current behavior.

hannahilea commented 2 years ago

From further discussion w @ericphanson, @kleinschmidt , one reason that we have to be careful with deprecation paths is that we use this all over the place when indexing into Onda.Samples via spans rather than sample indices.

kleinschmidt commented 2 years ago

After some discussion with @hannahilea and @ericphanson, we concluded that we can handle this behavior with something like the "rounding mode" behavior of Base.round, and that any default rounding mode is a major footgun for something as central to the Ondaverse as this is. One option would be to warn or throw an error when there's not an exact alignment with a span endpoint and sample times (the source of this behavior). That has two main disadvantages: it may slow down what should be a fast code path, and it will likely break between test cases and actual data "in the wild" (e.g. distributed jobs with large training batches, data from automated pipelines with varying sampling rates, etc.).

Another issue with a default rounding mode is that there isn't a rounding mode that is going to make sense or obey folks' intuitions in all cases. If you use the "intersect with any sample period" rounding mode (my proposal above), you will get spans of different lengths depending on whether you're exactly aligned with span times. If you round to a consistent number of samples, you will sometimes omit final (or initial) samples, so the interpretation of the timespan as an interval fails. Same (on both counts) for rounding up/down/in/out.

So it seems worthwhile to tradeoff some convenience (being able to do things like Samples[:, Timespan(...)]) for being explicit about how non-exact span-segment alignment should be handled (e.g., requiring that you write Samples[:, TimeSpan(...); round=Down] or samples[:, TimeSpan(...), Down])

jrevels commented 2 years ago

Another issue with a default rounding mode is that there isn't a rounding mode that is going to make sense or obey folks' intuitions in all cases. If you use the "intersect with any sample period" rounding mode (my proposal above), you will get spans of different lengths depending on whether you're exactly aligned with span times. If you round to a consistent number of samples, you will sometimes omit final (or initial) samples, so the interpretation of the timespan as an interval fails. Same (on both counts) for rounding up/down/in/out.

💯

Yup, it's all about just picking whatever strategy lines up with your expectations about which invariants/properties you want to preserve. For simplicity's sake, TimeSpans just demands callers to align their expectations with the strategy it happens to implement 😅 not fun for callers who might desire a different interpretation but it is at least self-consistent

worth noting #2, in which case usages of TimeSpan would be replaced in user code by Interval{<:Period,Closed,Open}. I think if we wanted to solve this particular API problem for time <-> index conversion it'd be more worthwhile to do it atop Intervals than atop TimeSpans

I also wonder if more concretely baking the "interpretation choice" into these kinds of structures would also help, e.g. having callers use Period ranges like Millisecond(start):Millisecond(period):Millisecond(stop) and then indexing into/from those structures. or maybe it just punts the problem

ericphanson commented 2 years ago

I think the retitling doesn't quite capture everything here--- I see it as three issues that came up here:

1. Docs issue

The current implementation of index_from_time(sample_rate, span::TimeSpan) makes an undocumented choice of semantics, which is to round-down the endpoints, and ensure the right endpoint of the span is excluded.

What do I mean this is undocumented? Well,

integer index of the most recent sample taken at sample_time

documents the index_from_time(sample_rate, time) method. That isn't sufficient to learn the semantics of the span method, since we do something other than simply call index_from_time on each endpoint, re

https://github.com/beacon-biosignals/TimeSpans.jl/blob/eb8b7949b3bc4b82c7f176f3b5227a48c08f4dce/src/TimeSpans.jl#L237

and we only document the span method as

https://github.com/beacon-biosignals/TimeSpans.jl/blob/eb8b7949b3bc4b82c7f176f3b5227a48c08f4dce/src/TimeSpans.jl#L221

which leaves it unclear exactly what UnitRange we're talking about.

2. Implementation bug

The implementation of those semantics isn't correct when the right endpoint of the span doesn't fall exactly on a sample; we always throw away the right-endpoint even when it is well within the span. @kleinschmidt's suggestion

I think the solution might be to convert the final index back to time, and see if it needs to be adjusted down by 1 (e.g., if time_from_index(rate, index_from_time(rate, stop(span))) == stop(span) then you use j-1, otherwise use j

would fix that, I think.

3. Suggestion to change the semantics.

ref https://github.com/beacon-biosignals/TimeSpans.jl/issues/26#issuecomment-1016887476, https://github.com/beacon-biosignals/TimeSpans.jl/issues/26#issuecomment-1016914898

To add to those-- I think there's too much magic right now: index_from_time is used internally by indexing into Onda.Samples and other places, and it's not clear on that level what semantics you are getting when you do samples[:, span]. (The Onda docs point to the tour, which doesn't discuss rounding). This indexing is convenient until something breaks and it turns out you would've been far better off carefully thinking about rounding semantics when you wrote the code originally. So I think we not round automatically, and either force the user to pass spans whose endpoints are divisible by the sample rate (and provide helpers to construct these spans), or force them to pass a rounding mode.

jrevels commented 2 years ago

I see it as three issues that came up here:

sounds like I've retitled this to issue 3, the other two can be filed separately?

The implementation of those semantics isn't correct when the right endpoint of the span doesn't fall exactly on a sample; we always throw away the right-endpoint even when it is well within the span. @kleinschmidt's suggestion

I think I'd need to consider this (or at least review associated PR a bit more deeply), haven't gotten a chance to thoroughly dig into it but i suspect there's a possibility that fixing the perceived bug might change the intended semantics (i.e. break existing tests). maybe not though and it really is just a simple bug that needs fixing

jrevels commented 2 years ago

I think there's too much magic right now: index_from_time is used internally by indexing into Onda.Samples and other places, and it's not clear on that level what semantics you are getting when you do samples[:, span]. (The Onda docs point to the tour, which doesn't discuss rounding). This indexing is convenient until something breaks and it turns out you would've been far better off carefully thinking about rounding semantics when you wrote the code originally.

I think this mischaracterizes what the package is intended to provide...it shouldn't result in the class of bugs one would normally refer to as "magical" - it's not trying to guess a rounding mode for you and then guessing wrong. It's intended to only provide a single implementation that has one particular "rounding mode" baked in.

I think this is really the docs issue you brought up, because it seems like people ARE expecting TimeSpans to guess a rounding mode for their use case, which I agree is a bad expectation and is not a feasible thing to try to implement

I also agree that providing user-selectable rounding modes is a fine idea, but I'm not sure I want to put further development into TimeSpans.jl when the intention from the inception of the package was supposed to be to deprecate it in favor of Intervals (it was originally split out of Onda at the time to facilitate decoupling the interface). Though maybe the index <-> time conversion parts should live in their own package to compose with Intervals.jl. either way

it turns out you would've been far better off carefully thinking about rounding semantics when you wrote the code originally.

this is definitely true and should be true for anybody using TimeSpans today as-is, regardless of future changes

ericphanson commented 2 years ago

sounds like I've retitled this to issue 3, the other two can be filed separately?

Sounds good, will do.

I think this mischaracterizes what the package is intended to provide...it shouldn't result in the class of bugs one would normally refer to as "magical" - it's not trying to guess a rounding mode for you and then guessing wrong. It's intended to only provide a single implementation that has one particular "rounding mode" baked in.

Hm, the thing I'm calling "too magical" is samples[:, TimeSpan(Minute(1), Minute(10))] (not index_from_time directly). I think index_from_time (at least with a period, not a span) is pretty clearly documented. But I think having samples[:, span] has led to overly casual use (by me! but also by everyone AFAIK) where the rounding implications are only thought about later when something goes wrong.

I also agree that providing user-selectable rounding modes is a fine idea, but I'm not sure I want to put further development into TimeSpans.jl when the intention from the inception of the package was supposed to be to deprecate it in favor of Intervals (it was originally split out of Onda at the time to facilitate decoupling the interface). Though maybe the index <-> time conversion parts should live in their own package to compose with Intervals.jl. either way

To me, it doesn't really matter what package we use; we'll have the same questions about "how to ergonomically grab subsets of samples corresponding to a given timespan, while handling index <-> time issues correctly for each use case". Agreed that index <-> time might end up being a separate package (or maybe part of Onda again?).

this is definitely true and should be true for anybody using TimeSpans today as-is, regardless of future changes

Yes totally, my point is I don't think that's not happening right now to the extent it needs to (and I am definitely as guilty of that as anyone else).

jrevels commented 1 year ago

I think the bug that spawned this issue has been fixed as of #51

julia> TimeSpans.index_from_time(1, TimeSpan(Millisecond(1500), Millisecond(2500)))
2:3