ImagingDataCommons / ETL

(CORE REPO)
Apache License 2.0
0 stars 1 forks source link

[clinical] Fix out of order columns in the per-collection tables #22

Closed fedorov closed 2 years ago

fedorov commented 2 years ago

It is indeed unfortunate that the columns of the tables are all shuffled.

I noticed you are creating the schema manually, but then are importing from JSON.

Did you consider just importing from JSON, and let BQ auto-detect the schema? (e.g., I do this here for ACRIN tables: https://github.com/fedorov/idc-clinical-cleanup/blob/main/bq_exporter.py#L13-L36).

If that auto-detection of the column type does not produce results needed, we might do some casting later. But as is, it is very challenging to compare original data and ingested data side by side.

G-White-ISB commented 2 years ago

I had originally detected the column types using pandas but I'll try auto-detection in BigQuery.

G-White-ISB commented 2 years ago

So auto-detect in BigQuery will not work because the clinical data is sparse. Some patients rows have missing columns. But I figured out how to do this using the predetermined schema. This column order is now fixed in idc-dev-etl.clinical dataset

fedorov commented 2 years ago

So auto-detect in BigQuery will not work because the clinical data is sparse. Some patients rows have missing columns.

I don't understand that. I understand individual patients may have empty values in certain columns, but the columns are defined per-table, how can they be missing?

G-White-ISB commented 2 years ago

In the JSON array used to populate the tables, patient rows will have missing attributes . The first patient will have one set of defined attributes, and the 2nd patient will have a somewhat different set etc. This makes it tricky to auto-detect the column order. I expect that, using auto-detect, the columns are created in the order the corresponding attributes are first encountered.

fedorov commented 2 years ago

This has been resolved.