Medical-Event-Data-Standard / meds

Schema definitions and Python types for Medical Event Data Standard, a standard for medical event data such as EHR and claims data
Apache License 2.0
38 stars 3 forks source link

How do I retrieve the visit information efficiently in the MEDS? #23

Open ChaoPang opened 7 months ago

ChaoPang commented 7 months ago

Background

I am trying to figure out how to retrieve visit_start_datetime, visit_end_datetime, visit_concept_id, and discharge_facility_code from the MED more efficiently. The reason is that I am currently converting CEHR-BERT to be MED compatible, which will require those fields to construct all the model inputs.

First experiment

In my first experiment, I used event to store the visit information, where I put visit_start_datetime as event.time, stored visit_concept_id as the first measurement of this event, and the discharge_facility_code as the last measurement (only for inpatient visits). To figure out visit_end_datetime, I had to iterate through all measurements and take the max timestamp from this event. Although this approach worked, the downside is that the MED datasets structured this way would not work for FEMR models directly (correct me if I am wrong) because an event can span a few days due to the inpatient visits, which may violate the meaning of an Event.

Second experiment: Med Extension

In the second experiment, we tried to extend the MED schema, where I introduced Visit and PatientExtension objects to store the information required for CEHR-BERT.

class Visit(TypedDict):
    visit_type: str
    visit_start_datetime: datetime.datetime
    visit_end_datetime: NotRequired[datetime.datetime]
    discharge_facility: NotRequired[str]
    events: List[Event]

class PatientExtension(TypedDict):
    patient_id: int
    static_measurements: List[Measurement]
    birth_datetime: datetime.datetime
    gender: str
    race: str
    visits: List[Visit]

In addition, I created a data conversion script to convert the MED extension back to the original MED format so it would work with FEMR models too.

from meds import Patient, Event, Measurement, birth_code
from med_extension.schema_extension import PatientExtension

def convert_med_extension_to_med(
        patient_extension: PatientExtension
) -> Patient:
    birth_date = patient_extension['birth_datetime']
    gender = patient_extension['gender']
    race = patient_extension['race']
    demographic_event = Event(
        time=birth_date,
        measurements=[
            Measurement(
                code=birth_code
            ),
            Measurement(
                code=gender
            ),
            Measurement(
                code=race
            )
        ]
    )
    events = [demographic_event]
    for visit in sorted(patient_extension['visits'], key=lambda v: v['visit_start_datetime']):
        events.extend(sorted(visit['events'], key=lambda e: e['time']))
    return Patient(
        patient_id=patient_extension['patient_id'],
        static_measurements=patient_extension['static_measurements'],
        events=events
    )

Do you have any plans to introduce Visit to the schema? If not, what would you recommend we do to retrieve visits efficiently in MED? Currently, visit_occurrence_id and visit_end_datetime are stored in the metadata field of Measurement. The information is available in the MED, the only way I could think of is to look up the visit_occurrence_id in the metadata field of the first measurement of each event, and connect events if they have the same visit_occurrence_id. What do you think is the best way to identify the groups of events that belong to the same inpatient visits?

EthanSteinberg commented 7 months ago

Thanks for bringing this up. Better handling of visits is important and this could / should be improved.

I think you have the right idea in terms of "Meds Extension". Basically, you want to define a second format "MEDS Visit" that is more organized around visit calculation, with transformations to that format from MEDS. I would recommend doing this transformation on the fly for models. So users pass in a MEDS dataset and your code would convert it to MEDS Visit as needed.

Note that we already have something similar called MEDS Flat, that makes writing ETLs easier (see https://github.com/Medical-Event-Data-Standard/meds_etl?tab=readme-ov-file#meds-flat).

For a first pass, I would call this format CEHRTPatient or something, with a name specific to your package. Once you get it working we could add it to this schema repository with the name MEDS Visit if you think it will be generally useful to other people.

For writing the transformation from MEDS to MEDS Visit, I would recommend using the visit_id tags that are retained in https://github.com/Medical-Event-Data-Standard/meds_etl/blob/main/src/meds_etl/omop.py#L277 and https://github.com/Medical-Event-Data-Standard/meds_etl/blob/main/src/meds_etl/mimic/__init__.py#L74

I'm not sure if those ETLs have everything you need to construct visit groupings. If you run into issues, feel free to make a Pull Request for the ETLs to add the information you need.


Now to answer your questions more explicitly

In my first experiment, I used event to store the visit information, where I put visit_start_datetime as event.time ... which may violate the meaning of an Event.

Yeah, this is invalid as measurements within an events (per the schema) must always have the same time.

Do you have any plans to introduce Visit to the schema?

I don't think so. If anything, we will probably make the core MEDS schema simpler (removing datetime_value, the double nesting of Event, etc).

The idea here is that we want a minimal schema with users providing additional structure (like MEDS Visit or MEDS Flat) as needed.

EthanSteinberg commented 7 months ago

@ChaoPang One idea that is also worth thinking about:

It might make your life easier if we add a metadata field "is_visit_record" to mark which records are supposed to indicate the start/end of each visit.

That way you would know that "MIMIC_IV_Admission/Inpatient" is a visit record and that "Visit/IP" (from OMOP) is a visit record.

That type of metadata field would make it easier for your code to support both OMOP and MIMIC (and potentially other datasets).

ChaoPang commented 7 months ago

@EthanSteinberg thanks a lot for the recommendation, that makes a lot of sense! I started looking at the meds_etl and was able to run it on an OMOP Synthea dataset (I was curious to see what the output looks like).

It might make your life easier if we add a metadata field "is_visit_record" to mark which records are supposed to indicate the start/end of each visit.

To identify visits, I iterated through all measurements with table=visit to retrieve all the visit records. It would be helpful to add is_visit_record for sure, however, I am not sure it would make a huge difference since this could be computed on the fly. One thing I wanted to bring up is that I had to modify the meds_etl locally to add discharge_facility as an additional metadata field (https://github.com/ChaoPang/meds_etl/commit/81604e3be13a53d15a97d016bbf925a57a12d1ab), it would be great if this field could be added to metadata in the MEDs, please let me know what you think.

Below is the logic I used to convert the MEDS Patient to CEHRTPatient,

INPATIENT_VISIT_TYPES = [
    '9201', '262', '8971', '8920', '38004311'
]
INPATIENT_VISIT_TYPE_CODES = [
    'Visit/IP', 'Visit/ERIP', 'Visit/51', 'Visit/61', 'NUCC/315D00000X'
]
def meds_to_cehrbert_extension(meds_record: Patient) -> CehrBertPatient:
    """
    Convert the MEDS Patient to the CerBertPatient extension, where the patient timeline is organized around visits
    """
    patient_id = meds_record['patient_id']
    static_measurements = meds_record['static_measurements']
    events = meds_record['events']

    assert len(events) >= 1

    birth_datetime = events[0]['time']
    race = None
    gender = None
    ethnicity = None

    for measurement in events[0]['measurements']:
        code = measurement['code']
        if 'race' in code.lower():
            race = code
        elif 'gender' in code.lower():
            gender = code
        elif 'ethnicity' in code.lower():
            ethnicity = code

    visit_mapping = dict()
    # Iterate through all measurements
    for m, time in [(m, e['time']) for e in events for m in e['measurements']]:
        if m['metadata']['table'] == 'visit':
            visit_type = m['code']
            is_inpatient = m['code'] in INPATIENT_VISIT_TYPE_CODES or m['code'] in INPATIENT_VISIT_TYPES
            visit_end_datetime = (
                datetime.datetime.strptime(m['metadata']['end'], DATE_FORMAT) if m['metadata']['end'] else None
            )
            discharge_facility = m['metadata']['discharge_facility'] if is_inpatient else None
            visit_id = m['metadata']['visit_id']
            visit_mapping[visit_id] = Visit(
                visit_type=visit_type,
                visit_start_datetime=time,
                visit_end_datetime=visit_end_datetime,
                discharge_facility=discharge_facility,
                events=[]
            )

    for e in events:
        events_with_visit = copy.deepcopy(e)
        for m in e['measurements']:
            # Remove the measurements without a visit_id, maybe these measurements should be connected to
            # the same visit_id since they have the same timestamp?
            if not m['metadata']['visit_id'] or m['metadata']['table'] == 'visit':
                events_with_visit['measurements'].remove(m)

        # add this event to the corresponding Visit
        if events_with_visit['measurements']:
            visit_id = events_with_visit['measurements'][0]['metadata']['visit_id']
            visit_mapping[visit_id]['events'].append(events_with_visit)

    return CehrBertPatient(
        patient_id=patient_id,
        static_measurements=static_measurements,
        birth_datetime=birth_datetime,
        visits=list(visit_mapping.values()),
        race=race,
        gender=gender,
        ethnicity=ethnicity
    )
EthanSteinberg commented 7 months ago

That looks like a good first pass! Responding to various of your questions / comments inline:

One thing I wanted to bring up is that I had to modify the meds_etl locally to add discharge_facility as an additional metadata field (https://github.com/ChaoPang/meds_etl/commit/81604e3be13a53d15a97d016bbf925a57a12d1ab), it would be great if this field could be added to metadata in the MEDs, please let me know what you think.

Feel free to submit a pull request to add that metadata in. Adding additional metadata should generally not cause any problems.

It would be helpful to add is_visit_record for sure, however, I am not sure it would make a huge difference since this could be computed on the fly.

It would mainly be useful if you want to support MIMIC as well. So you could have the same code path for both MIMIC and OMOP.

Below is the logic I used to convert the MEDS Patient to CEHRTPatient,

I think this is a good first pass, but I have some comments / suggestion.

One problem is that code might run into issues with weird OMOP datasets. For instance, events[0] is not always the birth event, you sometimes have events before birth depending on the dataset.

   if 'race' in code.lower():
        race = code
   elif 'gender' in code.lower():
       gender = code
   elif 'ethnicity' in code.lower():
       ethnicity = code

I would generally recommend doing an explicit prefix only query. If code.startswith('Gender/') will only match gender codes. Etc, etc. Your current code might be too inclusive.

    for e in events:
        events_with_visit = copy.deepcopy(e)
        for m in e['measurements']:
            # Remove the measurements without a visit_id, maybe these measurements should be connected to
            # the same visit_id since they have the same timestamp?
            if not m['metadata']['visit_id'] or m['metadata']['table'] == 'visit':
                events_with_visit['measurements'].remove(m)

        # add this event to the corresponding Visit
        if events_with_visit['measurements']:
            visit_id = events_with_visit['measurements'][0]['metadata']['visit_id']
            visit_mapping[visit_id]['events'].append(events_with_visit)

I don't see how this code could possibly work if you have an event with measurements from the different visits? Here is how this code should be written IMHO

   for e in events:
        for m in e['measurements']:
            # Remove the measurements without a visit_id, maybe these measurements should be connected to
            # the same visit_id since they have the same timestamp?
            if m['metadata']['visit_id'] and m['metadata']['table'] != 'visit':
                visit_mapping[visit_id]['events'].append(Event(time=e['time'], measurements=[m]))

I would also recommend sorting the events after you are done

for v in visit_mapping.values():
     v['events'].sort(key=e['time'])
mmcdermott commented 7 months ago

@ChaoPang I know I'm coming into this late, but it would be helpful to know a bit more about why you need to know if something is a visit / whether or not a measurement is within a visit, etc. Basically, I'm wondering if you could emulate something that we do in ESGPT that would fit into MEDS very naturally, which is we just explicitly add in events for "VISIT_START" and "VISIT_END".

In particular, if you have a record for a patient with an admission at timestamp t1, some labs l1, l2, at t2, and a medication m1 prescription @ t3, then a discharge at t4, you could encode that with:

[{"timestamp": t1, "measurements": [{"code": "ADMISSION"}]}, {"timestamp": t2, "measurements": [{"code": l1}, {"code": l2}]}, {"timetsamp": "t3", "measurements": [{"code": "m1"}]}, {"timestamp": "t4", "measurements": [{"code": "DISCHARGE"}]}

What this encoding misses is the possibility that you have overlapping visits, or that you need to know in a more direct fashion that the events between the "ADMISSION" and "DISCHARGE" belong to the same "visit" while still preserving their timestamps. If you need to know that they belong to the same visit but don't care about the intra-visit temporal ordering, I would just make the entire visit one event and put all codes inside of it, unordered. E.g., [{"timestamp": t1, "measurements": [{"code": l1}, {"code": l2}, {"code": m1}]}.

Would any of those strategies work?

ChaoPang commented 6 months ago

@mmcdermott thanks for sharing your thoughts on this, it's very much appreciated! This is very close to what I am doing in the CEHR-BERT/GPT conceptually, where the visit_start and visit_end are artificial measurements created in our data pipeline. The patient sequence required by CEHR-BERT/GPT looks like below, where C_i represents some arbitrary concept (I omitted the other CEHR-BERT inputs just for illustration purposes)

[VS] [outpatient] [C1] [VE] [one-week] [VS] [inpatient] [C2] [one-day] [C3] [Discharge type] [VE]

I am taking the MEDS as the input and converting it to a CEHR-BERT extension so that it can be ingested by the model. The reason that I wanted to introduce the following entities is for the ease of data processing.

class Visit(TypedDict):
    visit_type: str
    visit_start_datetime: datetime.datetime
    visit_end_datetime: NotRequired[datetime.datetime]
    discharge_facility: NotRequired[str]
    events: List[Event]

class PatientExtension(TypedDict):
    patient_id: int
    static_measurements: List[Measurement]
    birth_datetime: datetime.datetime
    gender: str
    race: str
    visits: List[Visit]

@EthanSteinberg However, I just realized that I don't actually need to use Event, perhaps I could use the Visit object below without events because all the information I need is stored in the measurements. Would this violate the core design of MEDS though? What I am thinking is that this conversion is most likely to be one-way from MEDS to CEHRBERT, therefore Event may not be needed unless this extension will be used by other folks, in which case, I should probably put a list of events in Visit instead of measurements directly. What are you thoughts on this?

class Visit(TypedDict):
    visit_type: str
    visit_start_datetime: datetime.datetime
    visit_end_datetime: NotRequired[datetime.datetime]
    discharge_facility: NotRequired[str]
    events: List[Measurement]
EthanSteinberg commented 6 months ago

@ChaoPang Yeah, getting rid of Event is the right call IMHO.

We are extremely likely to remove Event from MEDS as well soon.