Closed fedorov closed 2 years ago
The only way to be completely consistent with how all the other clinical data is presented would be to split the cptac_clinical table into separate tables based on the collection_id, ie. one for cptac_coad, cptac_gbm etc. This would not be be a problem technically. But there could be an advantage to doing cross collection analysis by having them all together.
We should probably discuss this.
But here's another complication: As far as I can tell neither the GDC, PDC, or ISB-CGC use the CPTAC collection ids as reported in TCIA. In these other resources all the CPTAC patients belong to either CPTAC2 or CPTAC3 and there is no further subdividing of the patients. The TCIA collection ids cptac-aml, cptac-brca, ... cptac-ucec don't appear in these resources. I suppose one could infer the collection id based on the disease type or primary site. These collection ids are not in the source BigQuery table I am pulling from ISB-CGC.
The cptac table is now split into several per collection tables with separate entries in table_metadata and column_metadata. collection_id is no longer an array
The field
collection_id
in the column-level metadata includes array for CPTAC collections. I think I can guess why this is the case (there is a single CPTAC table coming from ISB-CGC), but I am not sure this is the right way to encode this information. This will be quite confusing for the user. Schema of this column should be consistent with its schema in other places this column is encountered.To deal with this, we might include a column
Program
along withcollection_id
.Alternatively, we could unnest the multivalued rows of
collection_id
column and replicate column-level metadata for each collection in the program.