Closed EthanSteinberg closed 5 months ago
@tompollard @mmcdermott
Ok. I think I fixed all the issues that were brought up (except for static measurements).
Static measurements needs to be fixed on the schema repository first before it can be added to this.
I also added tests to demonstrate that the code works.
Sorry for the delay @EthanSteinberg. I took a quick look and added some comments. I think the main issue for me was clarity around use of
meds_etl_flat
andmeds_reverse_etl_flat
.I would expect
meds_etl_flat
to go from meds to meds-flat, and I'd expectmeds_reverse_etl_flat
to reverse this process. Is this what you intend, or have I misunderstood?I was successfully able to convert MIMIC-MEDS to MIMIC-MEDS-Flat with the following command:
meds_reverse_etl_flat ./output/meds ./output/flat
The name of the script confuses me (I'm not reversing anything!) but from a quick review it appears to have worked.
Maybe choosing slightly different names for the scripts would help to clarify usage (e.g. meds_to_flat_etl
and flat_to_meds_etl
).
@tompollard Thanks for the review and good catch on the naming!
I think I fixed all the issues you brought up.
I introduced the names "meds_to_flat_etl" and "meds_from_flat_etl" to make things more clear.
I want meds as a prefix since these are global commands and I think it would be good for them to all have the same prefix.
It can be saved for a later PR, but I think the best approach would be to switch to a single entrypoint with multiple subcommands (e.g. something like meds flatten
Yeah, this would probably be a good idea for a future PR.
thanks!
It is currently difficult to write ETLs directly into MEDS due to a mix of nesting (MEDS is a nested schema with patients, events, then measurements), HuggingFace related issues (like not supporting large_list), and the metadata struct.
This PR introduces an intermediate ETL stage called MEDS CSV.
MEDS CSV is essentially the MEDS schema, but in an unnested form with metadata flattened. It is much simpler to write ETLs into MEDS CSV, and then go from MEDS CSV to MEDS rather than doing everything in one shot.
We should / will rewrite the current OMOP and MIMIC ETLs to use this MEDS CSV intermediate format.