Medical-Event-Data-Standard / meds_etl

A collection of ETLs from common data formats to Medical Event Data Standard
Apache License 2.0
16 stars 3 forks source link

Create a MEDS FLAT schema + ETL to enable easier ETL writing #6

Closed EthanSteinberg closed 5 months ago

EthanSteinberg commented 6 months ago

It is currently difficult to write ETLs directly into MEDS due to a mix of nesting (MEDS is a nested schema with patients, events, then measurements), HuggingFace related issues (like not supporting large_list), and the metadata struct.

This PR introduces an intermediate ETL stage called MEDS CSV.

MEDS CSV is essentially the MEDS schema, but in an unnested form with metadata flattened. It is much simpler to write ETLs into MEDS CSV, and then go from MEDS CSV to MEDS rather than doing everything in one shot.

We should / will rewrite the current OMOP and MIMIC ETLs to use this MEDS CSV intermediate format.

EthanSteinbergPrealize commented 5 months ago

@tompollard @mmcdermott

Ok. I think I fixed all the issues that were brought up (except for static measurements).

Static measurements needs to be fixed on the schema repository first before it can be added to this.

I also added tests to demonstrate that the code works.

tompollard commented 5 months ago

Sorry for the delay @EthanSteinberg. I took a quick look and added some comments. I think the main issue for me was clarity around use of meds_etl_flat and meds_reverse_etl_flat.

I would expect meds_etl_flat to go from meds to meds-flat, and I'd expect meds_reverse_etl_flat to reverse this process. Is this what you intend, or have I misunderstood?

I was successfully able to convert MIMIC-MEDS to MIMIC-MEDS-Flat with the following command:

meds_reverse_etl_flat ./output/meds ./output/flat

The name of the script confuses me (I'm not reversing anything!) but from a quick review it appears to have worked.

Maybe choosing slightly different names for the scripts would help to clarify usage (e.g. meds_to_flat_etl and flat_to_meds_etl).

EthanSteinbergPrealize commented 5 months ago

@tompollard Thanks for the review and good catch on the naming!

I think I fixed all the issues you brought up.

I introduced the names "meds_to_flat_etl" and "meds_from_flat_etl" to make things more clear.

I want meds as a prefix since these are global commands and I think it would be good for them to all have the same prefix.

EthanSteinbergPrealize commented 5 months ago

It can be saved for a later PR, but I think the best approach would be to switch to a single entrypoint with multiple subcommands (e.g. something like meds flatten or meds-etl flatten .

Yeah, this would probably be a good idea for a future PR.

tompollard commented 5 months ago

thanks!