ImagingDataCommons / ETL

(CORE REPO)
Apache License 2.0
0 stars 1 forks source link

[clinical] collection_id should not be an array #32

Closed fedorov closed 2 years ago

fedorov commented 2 years ago

The field collection_id in the column-level metadata includes array for CPTAC collections. I think I can guess why this is the case (there is a single CPTAC table coming from ISB-CGC), but I am not sure this is the right way to encode this information. This will be quite confusing for the user. Schema of this column should be consistent with its schema in other places this column is encountered.

To deal with this, we might include a column Program along with collection_id.

Alternatively, we could unnest the multivalued rows of collection_id column and replicate column-level metadata for each collection in the program.

G-White-ISB commented 2 years ago

The only way to be completely consistent with how all the other clinical data is presented would be to split the cptac_clinical table into separate tables based on the collection_id, ie. one for cptac_coad, cptac_gbm etc. This would not be be a problem technically. But there could be an advantage to doing cross collection analysis by having them all together.

We should probably discuss this.

G-White-ISB commented 2 years ago

But here's another complication: As far as I can tell neither the GDC, PDC, or ISB-CGC use the CPTAC collection ids as reported in TCIA. In these other resources all the CPTAC patients belong to either CPTAC2 or CPTAC3 and there is no further subdividing of the patients. The TCIA collection ids cptac-aml, cptac-brca, ... cptac-ucec don't appear in these resources. I suppose one could infer the collection id based on the disease type or primary site. These collection ids are not in the source BigQuery table I am pulling from ISB-CGC.

G-White-ISB commented 2 years ago

The cptac table is now split into several per collection tables with separate entries in table_metadata and column_metadata. collection_id is no longer an array