beacon-biosignals / Onda.jl

A Julia package for high-throughput manipulation of structured signal data across arbitrary domain-specific encodings, file formats and storage layers
Other
67 stars 5 forks source link

define a standardized, serializable representation for modality-agnostic "montage descriptions" #114

Open jrevels opened 2 years ago

jrevels commented 2 years ago

We've talked about this for years internally at Beacon, but have never actually gotten around to implementing it.

At Beacon, we have libraries, applications, and distributed services that all perform certain kinds of re-referencing/montaging operations. It's thus pretty common for one actor to want to describe a montage to be fulfilled/resolved/computed by another actor, and for actors to reuse each other's montage descriptions for consistency purposes. Examples:

For the sake of interopability between such actors (especially across language environments and service/package API boundaries), it'd be useful/appropriate for Onda to define a generic specification for a serializable representation of a "montage description". It probably should be defined in JSON (which is encodable in Arrow) to enable portability/applicability across many language environments.

It would need to support common arithmetic broadcast operations as well as reduction operations. We should clearly define what "resolvers" need to support. We probably want to also ensure the standard representation is extensible with user-defined operations to support modality-specific operations (e.g., filtering).

jaypath commented 2 years ago

Agree... it is not reasonable to build a "library" of montages. Sometimes custom montages are required (a common example is when a lead switch is known to have occurred, and then the montage would need to substitute the leads as appropriate).

Perhaps a montage description/definition in which a set of reference channels is defined (system ref, avg ref, A1, A2, etc) and then channel defs as input#-ref# (example, define references to be ref (recording reference), avgref as the average of all EEG channels, A2 as right ear, and [Fp2 +F7 + f3] + 0.5 * [t3 + c3 + f8] as localLaplacianAverave_fp1. Then I want a montage that has Fp1-ref, Fp1-avgref, Fp1-A2, and Fp1-localLaplacianAverage_fp1

jrevels commented 2 years ago

Got a start on this while on the plane from SFO to JFK yesterday!

Intro

First, let's define a "montage" in Onda-land to be some specification of a transformation of the form:

(input_multichannel_sample_data_1, input_multichannel_sample_data_2, ...) -> (output_multichannel_sample_data_1, output_multichannel_sample_data_2, ...)

We want Onda montages to be primarily concerned with operations over collections of channels; including realiasing, pattern matching, grouping/stacking/splitting, and numerical broadcast/reduction operations. They may be notably unconcerned with:

If we canonicalize a (de)serializable DSL representation for Onda montages, then Onda.jl (and others) can implement resolver functions like:

# executes the transformation specified by `montage` given the input data
execute(montage,
        input_multichannel_sample_data_1,
        input_multichannel_sample_data_2,
        ...)

This would basically require implementing an interpreter for the montage DSL. If we implemented an abstract interpreter (or made sure our interpreter could be run in an abstraction interpretation mode), we could use the exact same DSL to do "data-less" transformation/matching, e.g.

# returns the derived output channel names
evaluate(montage,
         input_channel_names_1,
         input_channel_names_2,
         ...)

(Note that in this approach, like with regex, an empty output channel set would mean that the input didn't "match" the montage.)

It's worth calling out that Julia is a pretty nice language for implementing these kinds of interpreters - you basically just parse content into AST nodes represented via Julia types, then you can use multiple dispatch to your heart's content to actually drive the interpretation.

Simple JSON-based DSL

Here's an extremely rough sketch of a DSL for representing montages staged atop JSON, with AST nodes represented by single-element objects:

Examples

Return all channels of the input minus the average of channels a, b, and c:

[
    {"$channels":{"get":["$1",{"any":["a","b","c"]}]}},
    {"$avg":{"mean":"$channels"}},
    [{"-":["$1","$avg"]}]
]

Return the dynamic range of the first signal minus the dynamic range of second, stacked on the magnitude of the difference of the two signals' means:

[
    {"$range1":{"-":[{"max":"$1"},{"min":"$1"}]}},
    {"$range2":{"-":[{"max":"$2"},{"min":"$2"}]}},
    {"$diff1":{"-":["$range1","$range2"]}},
    {"$diff2":{"abs":{"-":[{"mean":"$1"},{"mean":"$2"}]}}}
    {"return": [{"stack":["$diff1","$diff2"]}]}
]

Restructure three signals into two signals:

[
    {"$x": [
        {"alias":["x1", {"get":["$1", "a"]}]},
        {"alias":["x2", {"get":["$2", "a"]}]},
        {"alias":["x3", {"get":["$3", "a"]}]},
    ]},
    {"$y": [
        {"alias":["y1", {"get":["$1", "b"]}]},
        {"alias":["y2", {"get":["$2", "b"]}]},
        {"alias":["y3", {"get":["$3", "b"]}]},
    ]},
    {"return": ["$x", "$y"]}
]

Same DSL, Julia-Flavored Syntax

We can stage the same language upon (a very limited subset of) Julia syntax instead of JSON.

Pros:

Cons:

Below are the same examples written in the JSON DSL, written in Julia-Flavored syntax instead (uses IN[i] instead of $i, => for alias, etc.):

Return all channels of the input minus the average of channels a, b, and c:

channels = IN[1][any("a", "b", "c")];
avg = mean(channels);
return [IN[1] - avg];

Return the dynamic range of the first signal minus the dynamic range of second, stacked on the magnitude of the difference of the two signals' means:

range1 = max(IN[1]) - min(IN[1]);
range2 = max(IN[1]) - min(IN[1]);
diff1 = range1 - range2;
diff2 = abs(mean(IN[1]) - mean(IN[2]));
return [stack(diff1, diff2)];

Restructure three signals into two signals:

x = ["x1" => IN[1]["a"],
     "x2" => IN[2]["a"],
     "x3" => IN[3]["a"]];
y = ["y1" => IN[1]["b"],
     "y2" => IN[2]["b"],
     "y3" => IN[3]["b"]];
return [x, y];

Channel Name Propagation/Generation

Our DSL enables authors to realias channel names whenever they want, but what if they don't? What are the names of the output channels? To answer this question, we should define semantics for tracing/accumulating/propagating intermediate channel names at each operation of our DSL.

Take our earlier example:

[
    {"$channels":{"get":["$1",{"any":["a","b","c"]}]}},
    {"$avg":{"mean":"$channels"}},
    [{"-":["$1","$avg"]}]
]

What should the output channel names be given input channels ["a", "x", "b", "y", "c", "z"]?

IMO it'd make sense for the semantics to follow the bindings used by the author, in which case we might have to add some notion to drive when we want bindings to inline into channel names, or maybe those semantics can be hardcoded into our statement/broadcast/reduction operations.

You could imagine defining these semantics so that it's obvious that the answer to the above is:

["a-avg", "x-avg", "b-avg", "y-avg", "c-avg", "z-avg"]

This also implies that we might need to change Onda's validation criteria for onda.signal@1 channels to allow for much richer structure.

Another thing to consider is cross-referencing across signals, like:

[
    [{"-":["$1","$2"]}]
]

From the Onda specification:

Furthermore, to allow arbitrary cross-signal referencing, a and/or b may be channel names from other signals contained in the recording. If this is the case, such a name must be qualified in the format signal_name.channel_name. For example, an eog signal might have a channel named left-eeg.m1 (the left eye electrode referenced to the mastoid electrode from a 10-20 EEG signal).

This would require some thought to get right / avoid too much magic

jrevels commented 2 years ago

I should point out that one of my prerequisite action items to the above design brainstorm was to do a bit of review and see how other systems handle this (e.g. for EEG remontaging/specification - I know lots of other systems have some form of custom montage specification); however my plane's WiFi was busted so I couldn't search for stuff, so I dove ahead 😁

Would still be good to do a bit of review here, though.

jaypath commented 2 years ago

My take on montages is SUPER simple. Wrote it out here, hopefully you can read my writing. I can define a transform matrix for all the montages I use... though you have possibly suggested use cases where I could not (though I'm not sure about that).

Happy to meet... about to set up a discussion with Dave and Mike.. 20220214101042522.pdf .

jrevels commented 2 years ago

dope! thanks, that makes sense

i think the "essence" of the DSL-based approach is to enable doing these transforms in a "basis agnostic" way from a linear algebra perspective, since in Onda the basis is data-driven (by sets of channels) not fixed by convention. hence the functional form vs. the matrix form

might be convenient to include a matrix representation as well though

i guess also you could establish some convention like "i don't know what the basis is, but I do always sort the channel names lexographically," in which case you could represent canonical transforms as slices of the same canonical (infinite dimensional but countable) matrix

That makes me realize we could also separate this out into two things:

  1. pattern-matching that extracts + orders a well-known channel set (i.e. explicitly establishes the basis for the montage transform)
  2. the linear transform for applying the montage

that actually might be cleaner but maybe less flexible? worth thinking about

SimonDanisch commented 2 years ago

The matrix form seems easy to implement efficiently, which would be a great pro.

But, it doesn't seem to cover the entire scope of what we want from our transformation system (e.g. human readability, filtering). I have to admit, I'm also a bit biased against a system, that forces you to express any transform as a matrix. From my times with Matlab, it remember that it forces you to spent considerable time on figuring out the exact structure of the matrix, while something like all_channels .- mean(all_channels) is quite easy to comprehend and come up with. As someone who hasn't done matrix math in ages, understanding the avg transform example definitely took me quite some time to understand. This kind of cognitive work would be needed every time one tries to come up with new transforms, or every time looking at an existing one not written by oneself. Maybe we can extend on the hybrid @jrevels already proposed and treat the matrix form mainly as an optimization for our implementation?

For the JSON vs Julia debate: The Julia syntax does look much nicer and tempting now that I look at it again. I was first against it, but considering that julia is quite easy to parse and working with the AST shouldn't be harder then working with a nested json dict. I'd try very hard though, to make it work pretty much the same when run as code, so that the confusion for people knowing Julia stays minimal.

I want to bring up two more general concerns:

Performance

It wasn't easy to optimize streams montaging performance. Small slip ups like allocating one temporary array already were noticable in the latency (especially for multiple users), so to be usable in stream we mostly need the performance of handwritten code. Doing this for even more general transforms won't be easy (although, we can re-use stuff like the memory pool).

So, we should make sure that the specification won't make us invest months into implementing a difficult mapreduce transformation system. Would be interesting to see if we can use some existing packages to create optimal transformations (Taco, Dagger, some of the Einstein sum packages?).

Security

I think we do need to accept pretty arbitrary specifications from the web. Although, we should be able to restrict it to authorized users, which should decrease the attack surface quite a bit. But, when chosing the parser and the implementation, we should definitely keep an eye out for security concerns.

jrevels commented 2 years ago

Would be interesting to see if we can use some existing packages to create optimal transformations (Taco, Dagger, some of the Einstein sum packages?).

Yeah, that's a good idea. Worth calling out that "optimal" will mean different things to different users/use cases (e.g. latency vs. throughput). I think once we settle on / publish the spec, we should include a very basic montage resolver in Onda and maybe publish a more-dependency-heavy-but-more-performant resolver (backed by e.g. Dagger) in a separate package

I think we do need to accept pretty arbitrary specifications from the web. Although, we should be able to restrict it to authorized users, which should decrease the attack surface quite a bit. But, when chosing the parser and the implementation, we should definitely keep an eye out for security concerns.

I'm not sure about the relevance of this πŸ€” Do you mean a situation where an attacker (or always more likely, a well-intentioned caller making some mistake 😁) accidentally sent some huge transformation description to some resolver function/service that would overload its capacity? i'd imagine that's pretty easily mitigated by the resolver just having some basic statement/complexity limit or something

(keep in mind this thread is public / in OSS land, so there might not be as much context for internal Beacon projects πŸ™‚)

SimonDanisch commented 2 years ago

I'm not sure about the relevance of this

Yes, me neither! DSLs and parsing always ring some alarm bells for me, so I just wanted to quickly bring it up, in case there are some non obvious implications :)