Medical-Event-Data-Standard / meds_etl

A collection of ETLs from common data formats to Medical Event Data Standard
Apache License 2.0
24 stars 5 forks source link

OMOP to MEDS ETL fails with InvalidOperationError #33

Closed chang-alan-l closed 3 months ago

chang-alan-l commented 3 months ago

Hi there,

I'm working through performing the MEDS ETL pipeline for OMOP data from Stanford, getting a polars InvalidOperationError when running meds_etl_omop --backend cpp ${OMOP_DIR} ${TARGET_DIR} .

polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column '' for 8263683 out of 8263683 values: ["LOINC/39342-1", "dm+d/16131811000001103", … "NDC/00603002632"]

Which occurs from the event_data.collect().write_parquet() call.

My source OMOP directory based off a BigQuery extract job that has downloaded the following OMOP tables and custom extract tables as partitioned .csv.gz files within the following directories:

attribute_definition
care_site
cdm_source
child_birth_apgar_data
Cohort_ids
concept
concept_ancestor
concept_class
concept_relationship
concept_synonym
condition_era
condition_occurrence
cost
death
device_exposure
domain
dose_era
drug_era
drug_exposure
drug_strength
fact_relationship
location
measurement
metadata
Mom_child_deid_link
Mom_child_phi_link
note
note_nlp
observation
observation_period
payer_plan_period
person
procedure_occurrence
provider
relationship
source_to_concept_map
specimen
_TABLES_
visit_detail
visit_occurrence
vocabulary

With the following package versions:

meds                      0.2.0
meds-etl                  0.2.3
meds-etl-cpp              0.2.2
polars                    1.4.1

The output of event_data.collect_schema() is:

Schema([('patient_id', Int64), ('time', Datetime(time_unit='us', time_zone=None)), ('code', Int64), ('datetime_value', Datetime(time_unit='us', time_zone=None)), ('numeric_value', Float32), ('text_value', String), ('table', String), ('visit_id', String), ('unit', String), ('clarity_table', String)])

Apologies if I'm missing something obvious here in the preprocessing or other requirements for the ETL pipeline as I'm very new to the MEDS ecosystem, happy to try and provide additional information or diagnostic outputs.

EthanSteinberg commented 3 months ago

Hi Alan,

That error message looks like something went wrong with the csv parsing?

  1. Do you know what table it failed on?
  2. Can you provide the header line (so I can see the columns) of the file that failed?

As an aside, for Stanford, I would highly recommend downloading STARR-OMOP in Parquet format from BigQuery. Parquet is much faster and much more reliable.

See https://cloud.google.com/bigquery/docs/exporting-data#bq, make sure to set the format to Parquet.

chang-alan-l commented 3 months ago

Looks like it is failing on the observation table for me.

Header line for the first corresponding .csv.gz file, once extracted:

_observation_name,observation_id,person_id,observation_concept_id,observation_DATE,observation_DATETIME,observation_type_concept_id,value_as_number,value_as_string,value_as_concept_id,qualifier_concept_id,unit_concept_id,provider_id,visit_occurrence_id,visit_detail_id,observation_source_value,observation_source_concept_id,unit_source_value,qualifier_source_value,trace_id,unit_id,load_table_id
EthanSteinberg commented 3 months ago

Hmm.

"_observation_name" is not a valid OMOP column name? Can you double check that this got exported properly?

chang-alan-l commented 3 months ago

Did a little bit of digging on this today. It turns out that the table exported correctly, but "_observation_name" appears to be a convenience column added during this custom extract derived from a larger OMOP dataset. This column probably helps use the observation table in standalone analysis.

Interesting that it has the downstream effect that the wrong column gets attempted to be cast to Int64. After digging around with the debugger (while trying to hack together a version of the script that works for my dataset, no success so far but in progress), I wondering if you could point me roughly to where in omop.py the data type conversion logic is happening or where having the column named "_observation_name" might be throwing things off. https://github.com/Medical-Event-Data-Standard/meds_etl/blob/0.2.3/src/meds_etl/omop.py

Based on my read of the traceback, the code values created in this block (lines 453-454) are being attempted to be cast to Int64 down the line.

concept_id = pl.col("concept_id").cast(pl.Int64)
code = pl.col("vocabulary_id") + "/" + pl.col("concept_code")

More detailed polars output from event_data.explain() if that helps pinpoint the issue. Curious about the last text chunk FROM\n RENAME\n DF ["_observation_name", "observation_id", "person_id", "observation_concept_id"]; PROJECT 12/22 COLUMNS; SELECTION: None if it suggests that "_observation_name" is included based on matching the table name.

' SELECT [col("person_id").strict_cast(Int64).alias("patient_id"), col("observation_datetime").str.strptime([String(raise)]).coalesce([col("observation_datetime").str.strptime([String(raise)]).dt.offset_by([String(1d)]).dt.offset_by([String(-1s)])]).coalesce([col("observation_date").str.strptime([String(raise)]).coalesce([col("observation_date").str.strptime([String(raise)]).dt.offset_by([String(1d)]).dt.offset_by([String(-1s)])])]).alias("time"), when([(col("__POLARS_CSER_0xf486b40c929ab47e")) != (0)]).then(col("__POLARS_CSER_0xf486b40c929ab47e")).otherwise(when([(col("__POLARS_CSER_0xbaa0ccb3692fc5eb")) != (0)]).then(col("__POLARS_CSER_0xbaa0ccb3692fc5eb")).otherwise(null.strict_cast(Int64))).replace([Series, Series]).alias("code"), null.strict_cast(Datetime(Microseconds, None)).alias("datetime_value"), when(null.strict_cast(Datetime(Microseconds, None)).is_null()).then(col("__POLARS_CSER_0x823938c5c144b358").coalesce([when(col("__POLARS_CSER_0xe8569769b426d4f3")).then(col("__POLARS_CSER_0x3d3fd014ea9f871c")).otherwise(col("__POLARS_CSER_0xeb0f6726d578868e").replace_strict([Series, Series, null]))]).str.strip_chars([null]).cast(Float32)).otherwise(null.strict_cast(Float32)).alias("numeric_value"), when([(null.strict_cast(Datetime(Microseconds, None)).is_null()) & (when(null.strict_cast(Datetime(Microseconds, None)).is_null()).then(col("__POLARS_CSER_0x823938c5c144b358").coalesce([when(col("__POLARS_CSER_0xe8569769b426d4f3")).then(col("__POLARS_CSER_0x3d3fd014ea9f871c")).otherwise(col("__POLARS_CSER_0xeb0f6726d578868e").replace_strict([Series, Series, null]))]).str.strip_chars([null]).cast(Float32)).otherwise(null.strict_cast(Float32)).is_null())]).then(col("__POLARS_CSER_0x823938c5c144b358").coalesce([when(col("__POLARS_CSER_0xe8569769b426d4f3")).then(col("__POLARS_CSER_0x3d3fd014ea9f871c")).otherwise(col("__POLARS_CSER_0xeb0f6726d578868e").replace_strict([Series, Series, null]))]).str.strip_chars([null])).otherwise(null.strict_cast(String)).alias("text_value"), String(observation).alias("table"), col("visit_occurrence_id").alias("visit_id"), col("unit_source_value").alias("unit"), col("load_table_id").alias("clarity_table")] FROM\n   WITH_COLUMN :\n   [col("value_as_number").coalesce([col("value_as_string")]).alias("__POLARS_CSER_0x823938c5c144b358"), [([(col("observation_source_concept_id").strict_cast(Int64)) == (0)]) & ([(col("value_as_concept_id").strict_cast(Int64)) != (0)])].alias("__POLARS_CSER_0xe8569769b426d4f3"), col("observation_concept_id").strict_cast(Int64).alias("__POLARS_CSER_0xbaa0ccb3692fc5eb"), [(String(SOURCE_CODE/)) + (col("observation_source_value"))].alias("__POLARS_CSER_0x3d3fd014ea9f871c"), col("observation_source_concept_id").strict_cast(Int64).alias("__POLARS_CSER_0xf486b40c929ab47e"), col("value_as_concept_id").strict_cast(Int64).alias("__POLARS_CSER_0xeb0f6726d578868e")] \n    FILTER when([(col("observation_source_concept_id").strict_cast(Int64)) != (0)]).then(col("observation_source_concept_id").strict_cast(Int64)).otherwise(when([(col("observation_concept_id").strict_cast(Int64)) != (0)]).then(col("observation_concept_id").strict_cast(Int64)).otherwise(null.strict_cast(Int64))).replace([Series, Series]).is_not_null() FROM\n      RENAME\n        DF ["_observation_name", "observation_id", "person_id", "observation_concept_id"]; PROJECT 12/22 COLUMNS; SELECTION: None
EthanSteinberg commented 3 months ago

@chang-alan-l

You have the exact right code block. My guess is that the header got mangled somehow so what polars thinks is the concept_id column is actually a different column.

It's also possible that this is a polars bug.

kgweber-sag commented 3 months ago

I think the root cause here might actually be (meds_etl==0.1.2) in line 220:

code = concept_id.replace(concept_id_map)

The replace function is, I think, trying to keep concept_id and, therefore, code as integers even though we're swapping in a string value.

I happened to hit the same error on procedure_occurrence and cleared it by changing the line to:

code=concept_id.replace_strict(concept_id_map)

EthanSteinberg commented 3 months ago

Oh, I think I know what is going on here.

It looks like this was a breaking change between Polars version 0.20 and version 1. https://github.com/Medical-Event-Data-Standard/meds_etl/blob/0.1.2/pyproject.toml was too flexible of a version constraint, and thus this broke with Polars 1.

This also explains why I never saw this bug earlier. Polars 1 only recently came out and when I wrote this code originally Polars 1 wasn't out yet.

I've created a new release tag 0.1.3, that has the fixed pyproject.toml requirements: https://github.com/Medical-Event-Data-Standard/meds_etl/commit/d93fe27bc389313a7a1d1599f2d86ae9deab4e1f

Thanks for the help and apologies for the bug!

EthanSteinberg commented 3 months ago

I'm going to close this issue as I think it's now resolved? Please reopen if it is not.