beacon-biosignals / Onda.jl

A Julia package for high-throughput manipulation of structured signal data across arbitrary domain-specific encodings, file formats and storage layers
Other
67 stars 5 forks source link

Signals with a `start` time behavior strangely #101

Closed ericphanson closed 2 years ago

ericphanson commented 3 years ago

Discussed with @kimlaberinto and @hannahilea and we are kind of confused:

julia> using OndaDSP, Onda, Dates, UUIDs # OndaDSP is not open source yet, sorry onlookers

julia> samples = sine_wave(45Hz)
Samples (00:10:00.000000000):
  info.kind: "synthetic"
  info.channels: ["c1"]
  info.sample_unit: "microvolt"
  info.sample_resolution_in_unit: 0.25
  info.sample_offset_in_unit: 0
  sample_type(info): Int16
  info.sample_rate: 200 Hz
  encoded: false
  data:
1×120000 Matrix{Float64}:
 0.0  0.98769  0.308995  -0.891023  …  -0.308995  -0.98769  -5.53307e-12

julia> signal = store("test.lpcm.zst", "lpcm.zst", samples, uuid4(), Second(1))
Row(Schema("onda.signal@1"), (recording = UUID("fddda2b2-2e3e-401d-8d6a-8c90021924f1"), file_path = "test.lpcm.zst", file_format = "lpcm.zst", span = TimeSpan(00:00:01.000000000, 00:10:01.000000000), kind = "synthetic", channels = ["c1"], sample_unit = "microvolt", sample_resolution_in_unit = 0.25, sample_offset_in_unit = 0, sample_type = "int16", sample_rate = 200))

julia> load(signal)
Samples (00:10:00.000000000):
  info.kind: "synthetic"
  info.channels: ["c1"]
  info.sample_unit: "microvolt"
  info.sample_resolution_in_unit: 0.25
  info.sample_offset_in_unit: 0
  sample_type(info): Int16
  info.sample_rate: 200 Hz
  encoded: false
  data:
1×120000 Matrix{Float64}:
 0.0  1.0  0.25  -1.0  -0.5  0.75  …  -0.75  0.5  1.0  -0.25  -1.0  0.0

julia> load(signal, signal.span) # loses 1 second, no error
Samples (00:09:59.000000000):
  info.kind: "synthetic"
  info.channels: ["c1"]
  info.sample_unit: "microvolt"
  info.sample_resolution_in_unit: 0.25
  info.sample_offset_in_unit: 0
  sample_type(info): Int16
  info.sample_rate: 200 Hz
  encoded: false
  data:
1×119800 Matrix{Float64}:
 0.0  1.0  0.25  -1.0  -0.5  0.75  …  -0.75  0.5  1.0  -0.25  -1.0  0.0

julia> load(signal)[:, signal.span] # error
ERROR: BoundsError: attempt to access 1×120000 Matrix{Float64} at index [1:1, 201:120200]
Stacktrace:
 [1] throw_boundserror(A::Matrix{Float64}, I::Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}})
   @ Base ./abstractarray.jl:651
 [2] checkbounds
   @ ./abstractarray.jl:616 [inlined]
 [3] _getindex(::IndexLinear, ::Matrix{Float64}, ::Base.Slice{Base.OneTo{Int64}}, ::UnitRange{Int64})
   @ Base ./multidimensional.jl:831
 [4] getindex
   @ ./abstractarray.jl:1170 [inlined]
 [5] getindex(samples::Samples{Matrix{Float64}, Legolas.Row{Legolas.Schema{Symbol("onda.samples-info"), 1}, NamedTuple{(:kind, :channels, :sample_unit, :sample_resolution_in_unit, :sample_offset_in_unit, :sample_type, :sample_rate), Tuple{String, Vector{String}, String, Float64, Int64, String, Int64}}}}, rows::Function, columns::TimeSpans.TimeSpan)
   @ Onda ~/.julia/packages/Onda/iL8pS/src/samples.jl:137
 [6] top-level scope
   @ REPL[28]:1

julia> signal.span
TimeSpan(00:00:01.000000000, 00:10:01.000000000)

Is this how it is supposed to work? It doesn't seem right.

jrevels commented 3 years ago

just to recall from the docs

    load(signal[, span]; encoded::Bool=false)

If span is present, return load(...)[:, span]

julia> load(signal)[:, signal.span] # error

you're indexing relative to the Samples object itself, so this expression should always induce an error if start(signal.span) > Nanosecond(0). in other words, I think you might've wanted TimeSpan(0, duration(signal.span)) instead of signal.span

julia> load(signal, signal.span) # loses 1 second, no error Samples (00:09:59.000000000):

This should probably throw an error. not able to check right now but I bet the implementation is grabbing whatever data is available up until the stop of the span.

IIRC the underlying API method used here does this explicitly/purposefully, since it tries to follow the familiar Base.read(::IOStream, nb::Integer) behavior (e.g. "read at most nb bytes")

ericphanson commented 3 years ago

Ok, I think if the load errored the semantics would be more clear.

It seems like what a timespan is relative to changes a lot:

  1. when creating a Signal, start is relative to the start timestamp of the associated Recording (and signal.span is then also relative to the recording)
  2. when loading a Signal, the span needs to be relative to the start(signal.span)
  3. when indexing a Samples, its relative to start(span) for whatever span was loaded.

I think the discrepancy between (1) and (2) is what confused me here, but also (2) vs (3) can be confusing as well.

(1) vs (2) comes up if you are trying to use the same “wall clock” timespan for two signals from the same recording; you need to do a manual shift to get the same timespan.

(2) vs (3) I’ve basically got a handle on at this point (it’s a bit easier to understand I think since Samples don’t have any concept of a start position, so it has to work like that), but I think it’s still confusing when onboarding onto Onda and often requires manual shifts.

jrevels commented 3 years ago

it’s a bit easier to understand I think since Samples don’t have any concept of a start position, so it has to work like that

yup, this is the concept to remember. Samples instances are "context agnostic."

Onda actually started out the other way (Samples were coupled with their origin context), and IMO things got much nicer when we changed to the current approach

when loading a Signal, the span needs to be relative to the start(signal.span)

The easier mental path might be: "Samples are context agnostic, so getindex(x::Samples, :, ::TimeSpan) is of course relative to x itself, and then load(signal, span) is just an optimized fast path alternative to load(signal)[:, span] in certain situations."

(given that it's supposed to be an easily swappable fast-path alternative, it'd feel weird IMO for span behave differently in the two different expressions of the same operation)

but I think it’s still confusing when onboarding onto Onda and often requires manual shifts.

If helpful, we can add more content to examples/tour.jl if it's not already covered there

jrevels commented 3 years ago

(1) vs (2) comes up if you are trying to use the same “wall clock” timespan for two signals from the same recording; you need to do a manual shift to get the same timespan.

At least this is explicitly covered in the examples:

https://github.com/beacon-biosignals/Onda.jl/blob/bddf32faa18005ecf2b0ac23d2f16f1e8cd16dff/examples/tour.jl#L158-L162

ericphanson commented 3 years ago

Onda actually started out the other way (Samples were coupled with their origin context), and IMO things got much nicer when we changed to the current approach

Ok, good to know that this was already tried!

I can see how the context-agnosticness can be useful for writing generic code.

The easier mental path might be: "Samples are context agnostic, so getindex(x::Samples, :, ::TimeSpan) is of course relative to x itself, and then load(signal, span) is just an optimized fast path alternative to load(signal)[:, span] in certain situations."

I think this makes sense; so instead of seeing it as 3 choices of relativeness, there's only 2 because load should be the same as indexing.

I think it is still kind of confusing that load(signal, signal.span) is different from load(signal). I get that follows from the "load == indexing" principle, but I still just find it a confusing API. It really just seems like if span is omitted, then a natural default is the signal's span, but it's actually TimeSpan(0, duration(signal.span)) instead.

Was it like this in Onda v0.11 too? There it seems even more confusing since you are passing a recording UUID to load: https://beacon-biosignals.github.io/Onda.jl/v0.11/#Onda.load, so I'd expect the span to be relative to the recording's start time, not the samples object being pulled. Maybe you couldn't have signals with different start times relative to the recording then?

jrevels commented 2 years ago

I think it is still kind of confusing that load(signal, signal.span) is different from load(signal). I get that follows from the "load == indexing" principle, but I still just find it a confusing API. It really just seems like if span is omitted, then a natural default is the signal's span, but it's actually TimeSpan(0, duration(signal.span)) instead.

Since the format allows signals to have their own spans relative to their recording, we will deal with "relativity" in the API no matter what, so we'll end up picking a consistent/well-documented convention and sticking with it regardless. Let's say we picked the other possible convention here - then I would make the claim that it feels confusing that load's span isn't the same as getindex's span, and that I'd expect it to not require offsetting to switch between the two equivalent segment loading paths (load(signal, span) vs. load(signal)[:, span]).

More philosophically, since load is operating at a per-signal level of the Samples API (it doesn't have any knowledge about recordings or cross-signal operations), it feels more consistent to me keep it recording-agnostic and "internal" to the provided signal. Let the caller do recording-aware/cross-signal transforms, it'd be weird to try to "do that for them" in this one case

One thing we technically could do to make our conventional choice more configurable to callers is to either provide a kwarg like relative=true or even a RelativeTimeSpan struct or something. But I feel like it's actually more desirable/readable/minimal for the caller to just do whatever transform they want to do on the span they provide

Was it like this in Onda v0.11 too? There it seems even more confusing since you are passing a recording UUID to load: https://beacon-biosignals.github.io/Onda.jl/v0.11/#Onda.load, so I'd expect the span to be relative to the recording's start time, not the samples object being pulled. Maybe you couldn't have signals with different start times relative to the recording then?

ngl, I can't recall 😁

ericphanson commented 2 years ago

Yeah, that makes sense. Agreed it makes sense for the caller to do it, I think the issue is just that they might not realize they need to do it.

Hm, as a small tweak, what if we renamed the positional argument from span to signal_subspan or something like that? To indicate it is a relative subspan of the signal, not a span within the recording.