Closed chang-alan-l closed 3 months ago
Hi Alan,
That error message looks like something went wrong with the csv parsing?
As an aside, for Stanford, I would highly recommend downloading STARR-OMOP in Parquet format from BigQuery. Parquet is much faster and much more reliable.
See https://cloud.google.com/bigquery/docs/exporting-data#bq, make sure to set the format to Parquet.
Looks like it is failing on the observation
table for me.
Header line for the first corresponding .csv.gz file, once extracted:
_observation_name,observation_id,person_id,observation_concept_id,observation_DATE,observation_DATETIME,observation_type_concept_id,value_as_number,value_as_string,value_as_concept_id,qualifier_concept_id,unit_concept_id,provider_id,visit_occurrence_id,visit_detail_id,observation_source_value,observation_source_concept_id,unit_source_value,qualifier_source_value,trace_id,unit_id,load_table_id
Hmm.
"_observation_name" is not a valid OMOP column name? Can you double check that this got exported properly?
Did a little bit of digging on this today. It turns out that the table exported correctly, but "_observation_name" appears to be a convenience column added during this custom extract derived from a larger OMOP dataset. This column probably helps use the observation table in standalone analysis.
Interesting that it has the downstream effect that the wrong column gets attempted to be cast to Int64
. After digging around with the debugger (while trying to hack together a version of the script that works for my dataset, no success so far but in progress), I wondering if you could point me roughly to where in omop.py the data type conversion logic is happening or where having the column named "_observation_name" might be throwing things off.
https://github.com/Medical-Event-Data-Standard/meds_etl/blob/0.2.3/src/meds_etl/omop.py
Based on my read of the traceback, the code values created in this block (lines 453-454) are being attempted to be cast to Int64
down the line.
concept_id = pl.col("concept_id").cast(pl.Int64)
code = pl.col("vocabulary_id") + "/" + pl.col("concept_code")
More detailed polars output from event_data.explain()
if that helps pinpoint the issue. Curious about the last text chunk FROM\n RENAME\n DF ["_observation_name", "observation_id", "person_id", "observation_concept_id"]; PROJECT 12/22 COLUMNS; SELECTION: None
if it suggests that "_observation_name" is included based on matching the table name.
' SELECT [col("person_id").strict_cast(Int64).alias("patient_id"), col("observation_datetime").str.strptime([String(raise)]).coalesce([col("observation_datetime").str.strptime([String(raise)]).dt.offset_by([String(1d)]).dt.offset_by([String(-1s)])]).coalesce([col("observation_date").str.strptime([String(raise)]).coalesce([col("observation_date").str.strptime([String(raise)]).dt.offset_by([String(1d)]).dt.offset_by([String(-1s)])])]).alias("time"), when([(col("__POLARS_CSER_0xf486b40c929ab47e")) != (0)]).then(col("__POLARS_CSER_0xf486b40c929ab47e")).otherwise(when([(col("__POLARS_CSER_0xbaa0ccb3692fc5eb")) != (0)]).then(col("__POLARS_CSER_0xbaa0ccb3692fc5eb")).otherwise(null.strict_cast(Int64))).replace([Series, Series]).alias("code"), null.strict_cast(Datetime(Microseconds, None)).alias("datetime_value"), when(null.strict_cast(Datetime(Microseconds, None)).is_null()).then(col("__POLARS_CSER_0x823938c5c144b358").coalesce([when(col("__POLARS_CSER_0xe8569769b426d4f3")).then(col("__POLARS_CSER_0x3d3fd014ea9f871c")).otherwise(col("__POLARS_CSER_0xeb0f6726d578868e").replace_strict([Series, Series, null]))]).str.strip_chars([null]).cast(Float32)).otherwise(null.strict_cast(Float32)).alias("numeric_value"), when([(null.strict_cast(Datetime(Microseconds, None)).is_null()) & (when(null.strict_cast(Datetime(Microseconds, None)).is_null()).then(col("__POLARS_CSER_0x823938c5c144b358").coalesce([when(col("__POLARS_CSER_0xe8569769b426d4f3")).then(col("__POLARS_CSER_0x3d3fd014ea9f871c")).otherwise(col("__POLARS_CSER_0xeb0f6726d578868e").replace_strict([Series, Series, null]))]).str.strip_chars([null]).cast(Float32)).otherwise(null.strict_cast(Float32)).is_null())]).then(col("__POLARS_CSER_0x823938c5c144b358").coalesce([when(col("__POLARS_CSER_0xe8569769b426d4f3")).then(col("__POLARS_CSER_0x3d3fd014ea9f871c")).otherwise(col("__POLARS_CSER_0xeb0f6726d578868e").replace_strict([Series, Series, null]))]).str.strip_chars([null])).otherwise(null.strict_cast(String)).alias("text_value"), String(observation).alias("table"), col("visit_occurrence_id").alias("visit_id"), col("unit_source_value").alias("unit"), col("load_table_id").alias("clarity_table")] FROM\n WITH_COLUMN :\n [col("value_as_number").coalesce([col("value_as_string")]).alias("__POLARS_CSER_0x823938c5c144b358"), [([(col("observation_source_concept_id").strict_cast(Int64)) == (0)]) & ([(col("value_as_concept_id").strict_cast(Int64)) != (0)])].alias("__POLARS_CSER_0xe8569769b426d4f3"), col("observation_concept_id").strict_cast(Int64).alias("__POLARS_CSER_0xbaa0ccb3692fc5eb"), [(String(SOURCE_CODE/)) + (col("observation_source_value"))].alias("__POLARS_CSER_0x3d3fd014ea9f871c"), col("observation_source_concept_id").strict_cast(Int64).alias("__POLARS_CSER_0xf486b40c929ab47e"), col("value_as_concept_id").strict_cast(Int64).alias("__POLARS_CSER_0xeb0f6726d578868e")] \n FILTER when([(col("observation_source_concept_id").strict_cast(Int64)) != (0)]).then(col("observation_source_concept_id").strict_cast(Int64)).otherwise(when([(col("observation_concept_id").strict_cast(Int64)) != (0)]).then(col("observation_concept_id").strict_cast(Int64)).otherwise(null.strict_cast(Int64))).replace([Series, Series]).is_not_null() FROM\n RENAME\n DF ["_observation_name", "observation_id", "person_id", "observation_concept_id"]; PROJECT 12/22 COLUMNS; SELECTION: None
@chang-alan-l
You have the exact right code block. My guess is that the header got mangled somehow so what polars thinks is the concept_id column is actually a different column.
It's also possible that this is a polars bug.
I think the root cause here might actually be (meds_etl==0.1.2) in line 220:
code = concept_id.replace(concept_id_map)
The replace
function is, I think, trying to keep concept_id
and, therefore, code
as integers even though we're swapping in a string value.
I happened to hit the same error on procedure_occurrence
and cleared it by changing the line to:
code=concept_id.replace_strict(concept_id_map)
Oh, I think I know what is going on here.
It looks like this was a breaking change between Polars version 0.20 and version 1. https://github.com/Medical-Event-Data-Standard/meds_etl/blob/0.1.2/pyproject.toml was too flexible of a version constraint, and thus this broke with Polars 1.
This also explains why I never saw this bug earlier. Polars 1 only recently came out and when I wrote this code originally Polars 1 wasn't out yet.
I've created a new release tag 0.1.3, that has the fixed pyproject.toml requirements: https://github.com/Medical-Event-Data-Standard/meds_etl/commit/d93fe27bc389313a7a1d1599f2d86ae9deab4e1f
Thanks for the help and apologies for the bug!
I'm going to close this issue as I think it's now resolved? Please reopen if it is not.
Hi there,
I'm working through performing the MEDS ETL pipeline for OMOP data from Stanford, getting a polars
InvalidOperationError
when runningmeds_etl_omop --backend cpp ${OMOP_DIR} ${TARGET_DIR}
.Which occurs from the
event_data.collect().write_parquet()
call.My source OMOP directory based off a BigQuery extract job that has downloaded the following OMOP tables and custom extract tables as partitioned .csv.gz files within the following directories:
With the following package versions:
The output of
event_data.collect_schema()
is:Apologies if I'm missing something obvious here in the preprocessing or other requirements for the ETL pipeline as I'm very new to the MEDS ecosystem, happy to try and provide additional information or diagnostic outputs.