ImagingDataCommons / ETL

(CORE REPO)
Apache License 2.0
0 stars 1 forks source link

Inconsistencies identified for hnscc_3dct_rt_clinical table #35

Closed fedorov closed 2 years ago

fedorov commented 2 years ago

column_metadata lists a long list of variables for the hnscc_3dct_rt_clinical collection:

image

However, the referenced table is very short:

image

Developing this thought, an easy regression test should be, for each table_name in column_metadata, take the list of variable_name, and confirm that the list of columns from the schema in the corresponding table is exactly that. We should have a regression check and run this test on every update of the clinical metadata tables. I will submit a follow up separate ticket on that.

G-White-ISB commented 2 years ago

This collection has clinical data spread across different sheets in the Excel file corresponding to different types of attributes (demographics, patient response, weight etc). This was not recorded correctly in the clinical_notes.json file. The code was trying to create the same table hnscc_3dct_rt_clinical, for each of these sheets, creating a race condition. The sheet with the weight information 'won'. The column_metadata table recorded the columns for all of the different versions of hnscc_3dct_rt_clinical being created.

Updating the clinical_notes.json file data should fix this problem. As noted in #35, new regression testing will hopefully pick these errors in the future.

G-White-ISB commented 2 years ago

Visual inspection of the tables confirms consistency between hnscc-3dct_rt table column names and metadata in column_metadata