Open ChaoPang opened 7 months ago
Thanks for bringing this up. Better handling of visits is important and this could / should be improved.
I think you have the right idea in terms of "Meds Extension". Basically, you want to define a second format "MEDS Visit" that is more organized around visit calculation, with transformations to that format from MEDS. I would recommend doing this transformation on the fly for models. So users pass in a MEDS dataset and your code would convert it to MEDS Visit as needed.
Note that we already have something similar called MEDS Flat, that makes writing ETLs easier (see https://github.com/Medical-Event-Data-Standard/meds_etl?tab=readme-ov-file#meds-flat).
For a first pass, I would call this format CEHRTPatient or something, with a name specific to your package. Once you get it working we could add it to this schema repository with the name MEDS Visit if you think it will be generally useful to other people.
For writing the transformation from MEDS to MEDS Visit, I would recommend using the visit_id tags that are retained in https://github.com/Medical-Event-Data-Standard/meds_etl/blob/main/src/meds_etl/omop.py#L277 and https://github.com/Medical-Event-Data-Standard/meds_etl/blob/main/src/meds_etl/mimic/__init__.py#L74
I'm not sure if those ETLs have everything you need to construct visit groupings. If you run into issues, feel free to make a Pull Request for the ETLs to add the information you need.
Now to answer your questions more explicitly
In my first experiment, I used event to store the visit information, where I put visit_start_datetime as event.time ... which may violate the meaning of an Event.
Yeah, this is invalid as measurements within an events (per the schema) must always have the same time.
Do you have any plans to introduce Visit to the schema?
I don't think so. If anything, we will probably make the core MEDS schema simpler (removing datetime_value, the double nesting of Event, etc).
The idea here is that we want a minimal schema with users providing additional structure (like MEDS Visit or MEDS Flat) as needed.
@ChaoPang One idea that is also worth thinking about:
It might make your life easier if we add a metadata field "is_visit_record" to mark which records are supposed to indicate the start/end of each visit.
That way you would know that "MIMIC_IV_Admission/Inpatient" is a visit record and that "Visit/IP" (from OMOP) is a visit record.
That type of metadata field would make it easier for your code to support both OMOP and MIMIC (and potentially other datasets).
@EthanSteinberg thanks a lot for the recommendation, that makes a lot of sense! I started looking at the meds_etl
and was able to run it on an OMOP Synthea dataset (I was curious to see what the output looks like).
It might make your life easier if we add a metadata field "is_visit_record" to mark which records are supposed to indicate the start/end of each visit.
To identify visits, I iterated through all measurements with table=visit
to retrieve all the visit records. It would be helpful to add is_visit_record
for sure, however, I am not sure it would make a huge difference since this could be computed on the fly. One thing I wanted to bring up is that I had to modify the meds_etl
locally to add discharge_facility
as an additional metadata field (https://github.com/ChaoPang/meds_etl/commit/81604e3be13a53d15a97d016bbf925a57a12d1ab), it would be great if this field could be added to metadata in the MEDs, please let me know what you think.
Below is the logic I used to convert the MEDS Patient
to CEHRTPatient
,
INPATIENT_VISIT_TYPES = [
'9201', '262', '8971', '8920', '38004311'
]
INPATIENT_VISIT_TYPE_CODES = [
'Visit/IP', 'Visit/ERIP', 'Visit/51', 'Visit/61', 'NUCC/315D00000X'
]
def meds_to_cehrbert_extension(meds_record: Patient) -> CehrBertPatient:
"""
Convert the MEDS Patient to the CerBertPatient extension, where the patient timeline is organized around visits
"""
patient_id = meds_record['patient_id']
static_measurements = meds_record['static_measurements']
events = meds_record['events']
assert len(events) >= 1
birth_datetime = events[0]['time']
race = None
gender = None
ethnicity = None
for measurement in events[0]['measurements']:
code = measurement['code']
if 'race' in code.lower():
race = code
elif 'gender' in code.lower():
gender = code
elif 'ethnicity' in code.lower():
ethnicity = code
visit_mapping = dict()
# Iterate through all measurements
for m, time in [(m, e['time']) for e in events for m in e['measurements']]:
if m['metadata']['table'] == 'visit':
visit_type = m['code']
is_inpatient = m['code'] in INPATIENT_VISIT_TYPE_CODES or m['code'] in INPATIENT_VISIT_TYPES
visit_end_datetime = (
datetime.datetime.strptime(m['metadata']['end'], DATE_FORMAT) if m['metadata']['end'] else None
)
discharge_facility = m['metadata']['discharge_facility'] if is_inpatient else None
visit_id = m['metadata']['visit_id']
visit_mapping[visit_id] = Visit(
visit_type=visit_type,
visit_start_datetime=time,
visit_end_datetime=visit_end_datetime,
discharge_facility=discharge_facility,
events=[]
)
for e in events:
events_with_visit = copy.deepcopy(e)
for m in e['measurements']:
# Remove the measurements without a visit_id, maybe these measurements should be connected to
# the same visit_id since they have the same timestamp?
if not m['metadata']['visit_id'] or m['metadata']['table'] == 'visit':
events_with_visit['measurements'].remove(m)
# add this event to the corresponding Visit
if events_with_visit['measurements']:
visit_id = events_with_visit['measurements'][0]['metadata']['visit_id']
visit_mapping[visit_id]['events'].append(events_with_visit)
return CehrBertPatient(
patient_id=patient_id,
static_measurements=static_measurements,
birth_datetime=birth_datetime,
visits=list(visit_mapping.values()),
race=race,
gender=gender,
ethnicity=ethnicity
)
That looks like a good first pass! Responding to various of your questions / comments inline:
One thing I wanted to bring up is that I had to modify the meds_etl locally to add discharge_facility as an additional metadata field (https://github.com/ChaoPang/meds_etl/commit/81604e3be13a53d15a97d016bbf925a57a12d1ab), it would be great if this field could be added to metadata in the MEDs, please let me know what you think.
Feel free to submit a pull request to add that metadata in. Adding additional metadata should generally not cause any problems.
It would be helpful to add is_visit_record for sure, however, I am not sure it would make a huge difference since this could be computed on the fly.
It would mainly be useful if you want to support MIMIC as well. So you could have the same code path for both MIMIC and OMOP.
Below is the logic I used to convert the MEDS Patient to CEHRTPatient,
I think this is a good first pass, but I have some comments / suggestion.
One problem is that code might run into issues with weird OMOP datasets. For instance, events[0] is not always the birth event, you sometimes have events before birth depending on the dataset.
if 'race' in code.lower(): race = code elif 'gender' in code.lower(): gender = code elif 'ethnicity' in code.lower(): ethnicity = code
I would generally recommend doing an explicit prefix only query. If code.startswith('Gender/') will only match gender codes. Etc, etc. Your current code might be too inclusive.
for e in events:
events_with_visit = copy.deepcopy(e)
for m in e['measurements']:
# Remove the measurements without a visit_id, maybe these measurements should be connected to
# the same visit_id since they have the same timestamp?
if not m['metadata']['visit_id'] or m['metadata']['table'] == 'visit':
events_with_visit['measurements'].remove(m)
# add this event to the corresponding Visit
if events_with_visit['measurements']:
visit_id = events_with_visit['measurements'][0]['metadata']['visit_id']
visit_mapping[visit_id]['events'].append(events_with_visit)
I don't see how this code could possibly work if you have an event with measurements from the different visits? Here is how this code should be written IMHO
for e in events:
for m in e['measurements']:
# Remove the measurements without a visit_id, maybe these measurements should be connected to
# the same visit_id since they have the same timestamp?
if m['metadata']['visit_id'] and m['metadata']['table'] != 'visit':
visit_mapping[visit_id]['events'].append(Event(time=e['time'], measurements=[m]))
I would also recommend sorting the events after you are done
for v in visit_mapping.values():
v['events'].sort(key=e['time'])
@ChaoPang I know I'm coming into this late, but it would be helpful to know a bit more about why you need to know if something is a visit / whether or not a measurement is within a visit, etc. Basically, I'm wondering if you could emulate something that we do in ESGPT that would fit into MEDS very naturally, which is we just explicitly add in events for "VISIT_START" and "VISIT_END".
In particular, if you have a record for a patient with an admission at timestamp t1
, some labs l1
, l2
, at t2
, and a medication m1
prescription @ t3
, then a discharge at t4
, you could encode that with:
[{"timestamp": t1, "measurements": [{"code": "ADMISSION"}]}, {"timestamp": t2, "measurements": [{"code": l1}, {"code": l2}]}, {"timetsamp": "t3", "measurements": [{"code": "m1"}]}, {"timestamp": "t4", "measurements": [{"code": "DISCHARGE"}]}
What this encoding misses is the possibility that you have overlapping visits, or that you need to know in a more direct fashion that the events between the "ADMISSION"
and "DISCHARGE"
belong to the same "visit" while still preserving their timestamps. If you need to know that they belong to the same visit but don't care about the intra-visit temporal ordering, I would just make the entire visit one event and put all codes inside of it, unordered. E.g., [{"timestamp": t1, "measurements": [{"code": l1}, {"code": l2}, {"code": m1}]}
.
Would any of those strategies work?
@mmcdermott thanks for sharing your thoughts on this, it's very much appreciated! This is very close to what I am doing in the CEHR-BERT/GPT conceptually, where the visit_start
and visit_end
are artificial measurements created in our data pipeline. The patient sequence required by CEHR-BERT/GPT looks like below, where C_i
represents some arbitrary concept (I omitted the other CEHR-BERT inputs just for illustration purposes)
[VS] [outpatient] [C1] [VE] [one-week] [VS] [inpatient] [C2] [one-day] [C3] [Discharge type] [VE]
I am taking the MEDS as the input and converting it to a CEHR-BERT extension so that it can be ingested by the model. The reason that I wanted to introduce the following entities is for the ease of data processing.
class Visit(TypedDict):
visit_type: str
visit_start_datetime: datetime.datetime
visit_end_datetime: NotRequired[datetime.datetime]
discharge_facility: NotRequired[str]
events: List[Event]
class PatientExtension(TypedDict):
patient_id: int
static_measurements: List[Measurement]
birth_datetime: datetime.datetime
gender: str
race: str
visits: List[Visit]
@EthanSteinberg However, I just realized that I don't actually need to use Event
, perhaps I could use the Visit
object below without events
because all the information I need is stored in the measurements. Would this violate the core design of MEDS though? What I am thinking is that this conversion is most likely to be one-way from MEDS to CEHRBERT, therefore Event
may not be needed unless this extension will be used by other folks, in which case, I should probably put a list of events
in Visit
instead of measurements
directly. What are you thoughts on this?
class Visit(TypedDict):
visit_type: str
visit_start_datetime: datetime.datetime
visit_end_datetime: NotRequired[datetime.datetime]
discharge_facility: NotRequired[str]
events: List[Measurement]
@ChaoPang Yeah, getting rid of Event is the right call IMHO.
We are extremely likely to remove Event from MEDS as well soon.
Background
I am trying to figure out how to retrieve
visit_start_datetime
,visit_end_datetime
,visit_concept_id
, anddischarge_facility_code
from the MED more efficiently. The reason is that I am currently converting CEHR-BERT to be MED compatible, which will require those fields to construct all the model inputs.First experiment
In my first experiment, I used
event
to store the visit information, where I putvisit_start_datetime
asevent.time
, storedvisit_concept_id
as the first measurement of this event, and thedischarge_facility_code
as the last measurement (only for inpatient visits). To figure outvisit_end_datetime
, I had to iterate through all measurements and take the max timestamp from this event. Although this approach worked, the downside is that the MED datasets structured this way would not work for FEMR models directly (correct me if I am wrong) because anevent
can span a few days due to the inpatient visits, which may violate the meaning of an Event.Second experiment: Med Extension
In the second experiment, we tried to extend the MED schema, where I introduced
Visit
andPatientExtension
objects to store the information required for CEHR-BERT.In addition, I created a data conversion script to convert the MED extension back to the original MED format so it would work with FEMR models too.
Do you have any plans to introduce
Visit
to the schema? If not, what would you recommend we do to retrieve visits efficiently in MED? Currently,visit_occurrence_id
andvisit_end_datetime
are stored in the metadata field ofMeasurement
. The information is available in the MED, the only way I could think of is to look up thevisit_occurrence_id
in themetadata
field of the first measurement of each event, and connect events if they have the samevisit_occurrence_id
. What do you think is the best way to identify the groups of events that belong to the same inpatient visits?