[Proposal] Usability: Change `Event.time` => `Event.datetime`

Medical-Event-Data-Standard / meds

Schema definitions and Python types for Medical Event Data Standard, a standard for medical event data such as EHR and claims data

Apache License 2.0

38 stars 3 forks source link

[Proposal] Usability: Change `Event.time` => `Event.datetime` #16

Closed Miking98 closed 3 months ago

Miking98 commented 8 months ago

Suggestion to change Event.time => Event.datetime.

time is actually a datetime, so we should call it what it is to avoid confusion
time is confusing -- it could be a datetime (i.e. 3/2/2024 @ 11:49pm) or just the time (i.e. 11:49pm).
Currently, we have Event.time and Measurement.datetime_value. Both are of type datetime, so we should call them the same thing to avoid confusion.

Other suggestion: instead of just Event.datetime, can we put Event.start_datetime and Event.end_datetime into the actual schema for Event (with end_datetime Optional), rather than relegating end to some arbitrary field in the metadata?

A lot of labeling functions will depend on having an end_datetime, so it seems a bit confusing to hide this in metadata.

EthanSteinberg commented 7 months ago

I'd be down to a rename. I don't think it's necessary (and it would make writing code against the meds standard take more time), but if people think it's confusing, Event.datetime is fine.

jason-fries commented 7 months ago

We should decide on the default representational assumption of Event -- is it an interval or a point? There are tradeoffs of course

Events are intervals, requiring Event.start_datetime and Event.end_datetime. When an event is a point, Event.end_datetime is null
Events are points, requiring only Event.start_datetime. When an event is an interval, we materialize as 2 separate point events.

Option 1 places more burden on the ETL side, since end_datetime may require making assumptions on assignment. Option 2 seems more aligned with modern methods of linearizing/tokenization data to discrete sequences.

I guess I favor option 2, since we're learning into representations to feed into a model vs. representations that are more amendable to operations like writing labeling functions or other queries over the data (where interval reasoning might be super useful).

Miking98 commented 7 months ago

A challenge with option 2 is how do you map between the start and end of a particular event if they’re now discrete objects separated by (potentially) 100s of other events (maybe even with the same code)?

Ie you’d need to keep some “event UUID” in the metadata of both the start and end objects of an event (versus if they’re part of the same object, then there’s no ambiguity).

which means that the user would be responsible for generating and tracking UUIDs across events during the ETL, and any time they added/removed events adjusting those accordingly. On Mar 26, 2024 at 3:43 PM -0700, Jason Alan Fries @.***>, wrote:

We should decide on the default representational assumption of Event -- is it an interval or a point? There are tradeoffs of course

Events are intervals, requiring Event.start_datetime and Event.end_datetime. When an event is a point, Event.end_datetime is null

Events are points, requiring only Event.start_datetime. When an event is an interval, we materialize as 2 separate point events.

Option 1 places more burden on the ETL side, since end_datetime may require making assumptions on assignment. Option 2 seems more aligned with modern tokenization-style methods of linearizing data to discrete sequences. I guess I favor option 2, since we're learning into representations to feed into a model vs. representations that are more amendable to operations like writing labeling functions or other queries over the data (where interval reasoning might be super useful). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

EthanSteinberg commented 7 months ago

Having separate start and end events would cause issues. I think the best approach is the current one, where we have optional end attributes in the metadata.

jason-fries commented 7 months ago

I think it depends on how we envision MEDS data being used. Do we actually need to track specific events with a UUID or is it enough to provide an indicator token in the sequence that some end event has occurred? I don't know. It's worth considering what scenarios actually require interval events, but I concede it introduces some complexity at unknown benefit.

Since "tokenization" (i.e., the process of transforming partially ordered clinical events into a sequence) is defined higher in the MEDS transformation stack, I'm fine with using an interval event. However, I sort of hate having to always check metadata to see if an end event is defined, so I'd prefer explicit Event.start_datetime and Event.end_datetime. I don't have a sense for how much this blows up the schema size though.

EthanSteinberg commented 7 months ago

However, I sort of hate having to always check metadata to see if an end event is defined

You don't need to do that. For 99% of usecases, you only need the start. End events are very special and are application specific. The two main examples I am aware of (visits and prescriptions) have very different semantics and should generally not be handled by the same code anyways.