Medical-Event-Data-Standard / meds_etl

A collection of ETLs from common data formats to Medical Event Data Standard
Apache License 2.0
16 stars 3 forks source link

broken meds_etl_mimic in the main branch #21

Closed mirmuss closed 1 month ago

mirmuss commented 2 months ago

Hi, I wanted to try speed gains from the cpp backend when I realized the following issue:

Processing tables into /nfs/home/mirmus/mimiciv_meds/temp
Processing icu/icustays
Processing icu/caregiver
Processing icu/d_items
Processing icu/datetimeevents
Processing icu/inputevents
Traceback (most recent call last):
  File "/nfs/home/mirmus/envs/femr/bin/meds_etl_mimic", line 8, in <module>
    sys.exit(main())
  File "/nfs/home/mirmus/envs/femr/lib/python3.10/site-packages/meds_etl/mimic/__init__.py", line 404, in main
    event_data = table_csv.select(**columns).collect()
  File "/nfs/home/mirmus/envs/femr/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1683, in collect
    return wrap_df(ldf.collect())
polars.exceptions.SchemaError: invalid series dtype: expected `String`, got `datetime[μs]`

I face this very same error regardless of the backend, when I install from main branch instead of pip Am I missing something?

EthanSteinberg commented 2 months ago

Thanks for the report!

I can reproduce this bug, and it looks like something I accidentally broke that ETL. Sorry about that! (We test on mimic-iv-demo, but this bug only triggers on the full mimic-iv unfortunately).

I have just fixed this and did a full mimic ETL to verify that it works.

If you git pull you should be able to rerun.

I'll leave this issue open in case you run into further bugs.

mirmuss commented 2 months ago

Thanks for the fast response!

I face another error when using the duckdb backend:

Processing each shard
Converting from MEDS Flat to MEDS using DuckDB with sharding...
Traceback (most recent call last):
  File "/nfs/home/mirmus/envs/femr/bin/meds_etl_mimic", line 8, in <module>
    sys.exit(main())
  File "/nfs/home/mirmus/projects/femr/meds_etl/src/meds_etl/mimic/__init__.py", line 423, in main
    meds_etl.flat.convert_flat_to_meds(
  File "/nfs/home/mirmus/projects/femr/meds_etl/src/meds_etl/flat.py", line 803, in convert_flat_to_meds
    convert_flat_to_meds_duckdb(
  File "/nfs/home/mirmus/projects/femr/meds_etl/src/meds_etl/flat.py", line 695, in convert_flat_to_meds_duckdb
    conn.sql(f"CREATE VIEW v{i} as SELECT {full_query} FROM '{task}'")
duckdb.duckdb.ParserException: Parser Error: syntax error at or near "table"

However the cpp backend raised another error:

Processing each shard
/nfs/home/mirmus/projects/femr/meds_etl/src/meds_etl/mimic/__init__.py:472: UserWarning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.
  table = pl.read_csv(f, infer_schema_length=0)

which it seems to be fixed by adding .read() to the meds_etl/src/meds_etl/mimic/__init__.py:472 : table = pl.read_csv(f.read(), infer_schema_length=0)

this way I get 100 parquet files in the data directory.

I will be running femr repo using these outputs to make sure if everything is fine

EthanSteinberg commented 2 months ago

I face another error when using the duckdb backend:

@scottfleming It looks like your duckdb backend is malfunctioning? I'm going to remove the duckdb backend from the README then.

However the cpp backend raised another error:

That's not an error, that's a warning that our code isn't 100% efficient. It would be good to make this more efficient though, which I guess I can do.

EthanSteinberg commented 1 month ago

@mirmuss Did everything work for you? I think I'm going to close this issue in the meantime, but feel free to reopen if you run into further issues.