Closed Miking98 closed 3 months ago
I'd be down to a rename. I don't think it's necessary (and it would make writing code against the meds standard take more time), but if people think it's confusing, Event.datetime is fine.
We should decide on the default representational assumption of Event
-- is it an interval or a point? There are tradeoffs of course
Event.start_datetime
and Event.end_datetime
. When an event is a point, Event.end_datetime
is nullEvent.start_datetime
. When an event is an interval, we materialize as 2 separate point events.Option 1 places more burden on the ETL side, since end_datetime
may require making assumptions on assignment. Option 2 seems more aligned with modern methods of linearizing/tokenization data to discrete sequences.
I guess I favor option 2, since we're learning into representations to feed into a model vs. representations that are more amendable to operations like writing labeling functions or other queries over the data (where interval reasoning might be super useful).
A challenge with option 2 is how do you map between the start and end of a particular event if they’re now discrete objects separated by (potentially) 100s of other events (maybe even with the same code)?
Ie you’d need to keep some “event UUID” in the metadata of both the start and end objects of an event (versus if they’re part of the same object, then there’s no ambiguity).
which means that the user would be responsible for generating and tracking UUIDs across events during the ETL, and any time they added/removed events adjusting those accordingly. On Mar 26, 2024 at 3:43 PM -0700, Jason Alan Fries @.***>, wrote:
We should decide on the default representational assumption of Event -- is it an interval or a point? There are tradeoffs of course
- Events are intervals, requiring Event.start_datetime and Event.end_datetime. When an event is a point, Event.end_datetime is null
- Events are points, requiring only Event.start_datetime. When an event is an interval, we materialize as 2 separate point events.
Option 1 places more burden on the ETL side, since end_datetime may require making assumptions on assignment. Option 2 seems more aligned with modern tokenization-style methods of linearizing data to discrete sequences. I guess I favor option 2, since we're learning into representations to feed into a model vs. representations that are more amendable to operations like writing labeling functions or other queries over the data (where interval reasoning might be super useful). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Having separate start and end events would cause issues. I think the best approach is the current one, where we have optional end attributes in the metadata.
I think it depends on how we envision MEDS data being used. Do we actually need to track specific events with a UUID or is it enough to provide an indicator token in the sequence that some end event has occurred? I don't know. It's worth considering what scenarios actually require interval events, but I concede it introduces some complexity at unknown benefit.
Since "tokenization" (i.e., the process of transforming partially ordered clinical events into a sequence) is defined higher in the MEDS transformation stack, I'm fine with using an interval event. However, I sort of hate having to always check metadata to see if an end event is defined, so I'd prefer explicit Event.start_datetime
and Event.end_datetime
. I don't have a sense for how much this blows up the schema size though.
However, I sort of hate having to always check metadata to see if an end event is defined
You don't need to do that. For 99% of usecases, you only need the start. End events are very special and are application specific. The two main examples I am aware of (visits and prescriptions) have very different semantics and should generally not be handled by the same code anyways.
Suggestion to change
Event.time
=>Event.datetime
.time
is actually adatetime
, so we should call it what it is to avoid confusiontime
is confusing -- it could be a datetime (i.e. 3/2/2024 @ 11:49pm) or just the time (i.e. 11:49pm).Event.time
andMeasurement.datetime_value
. Both are of typedatetime
, so we should call them the same thing to avoid confusion.Other suggestion: instead of just
Event.datetime
, can we putEvent.start_datetime
andEvent.end_datetime
into the actual schema forEvent
(withend_datetime
Optional), rather than relegating end to some arbitrary field in the metadata?