Closed fedorov closed 2 years ago
I had originally detected the column types using pandas but I'll try auto-detection in BigQuery.
So auto-detect in BigQuery will not work because the clinical data is sparse. Some patients rows have missing columns. But I figured out how to do this using the predetermined schema. This column order is now fixed in idc-dev-etl.clinical dataset
So auto-detect in BigQuery will not work because the clinical data is sparse. Some patients rows have missing columns.
I don't understand that. I understand individual patients may have empty values in certain columns, but the columns are defined per-table, how can they be missing?
In the JSON array used to populate the tables, patient rows will have missing attributes . The first patient will have one set of defined attributes, and the 2nd patient will have a somewhat different set etc. This makes it tricky to auto-detect the column order. I expect that, using auto-detect, the columns are created in the order the corresponding attributes are first encountered.
This has been resolved.
It is indeed unfortunate that the columns of the tables are all shuffled.
I noticed you are creating the schema manually, but then are importing from JSON.
Did you consider just importing from JSON, and let BQ auto-detect the schema? (e.g., I do this here for ACRIN tables: https://github.com/fedorov/idc-clinical-cleanup/blob/main/bq_exporter.py#L13-L36).
If that auto-detection of the column type does not produce results needed, we might do some casting later. But as is, it is very challenging to compare original data and ingested data side by side.