OHDSI / dbt-synthea

[Under development] A dbt ETL project to convert a Synthea synthetic data set into the OMOP CDM
https://ohdsi.github.io/dbt-synthea/
Apache License 2.0
16 stars 6 forks source link

[BUG] Unclear Error When Loading Synthea Vocabularies #90

Closed TheCedarPrince closed 3 weeks ago

TheCedarPrince commented 4 weeks ago

When running the following lines

file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/vocab/csvs)
dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"

I get multiple errors with the run-operation and had to patch the macros/load_data_duckdb.sql as follows:

    {% for n, p in file_dict.items() %}
        {% do run_query("DROP TABLE IF EXISTS " ~ target_schema ~ "." ~ n.lower() ~ ";") %}
        {% do run_query("CREATE TABLE IF NOT EXISTS " ~ target_schema ~ "." ~ n.lower() ~ " AS SELECT * FROM read_csv('" ~ p ~ "', quote = \'\"\', delim = '\t', ignore_errors = true);") %}
    {% endfor %}

Otherwise, I see errors like this:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
01:41:34  Running with dbt=1.8.7
01:41:34  Registered adapter: duckdb=1.8.0
01:41:35  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
01:41:39  Encountered an error while running operation: Runtime Error
  Invalid Input Error: CSV Error on Line: 245521
  Original Line: 45883506   'Baby' or low-dose aspirin  Meas Value  LOINC   Answer  S   LA14345-5   19700101    20991231    
  Value with unterminated quote found.

  Possible fixes:
  * Enable ignore errors (ignore_errors=true) to skip this row
  * Set quote do empty or to a different value (e.g., quote='')

    file=/home/thecedarprince/Programs/SYNTHEA/vocabulary/CONCEPT.csv
    delimiter =      (Auto-Detected)
    quote = ' (Auto-Detected)
    escape = \ (Auto-Detected)
    new_line = \n (Auto-Detected)
    header = true (Auto-Detected)
    skip_rows = 0 (Auto-Detected)
    comment = \0 (Auto-Detected)
    date_format =  (Auto-Detected)
    timestamp_format =  (Auto-Detected)
    null_padding=0
    sample_size=20480
    ignore_errors=false
    all_varchar=0

Note: even with the patching, I still face errors later when loading vocabularies into DuckDB so this patching doesn't really work.

lawrenceadams commented 4 weeks ago

I've never used this macro @TheCedarPrince - but I have used duckdb/postgres a lot to do this and spent AGES trying to figure out how to do this.

Turns out all you need to do is turn off quote parsing - i.e.

read_csv('" ~ p ~ "', quote = '')

and it should work neatly from memory

TheCedarPrince commented 4 weeks ago

Yup! That fixed it on my side! I can make a PR later today unless if you beat me first @lawrenceadams !

lawrenceadams commented 4 weeks ago

You fire on! You caught it! Glad to hear

TheCedarPrince commented 3 weeks ago

Closed thanks to #93