[clinical] Add column to point to the dataset/table from the dictionary

ImagingDataCommons / ETL

(CORE REPO)

Apache License 2.0

0 stars 1 forks source link

[clinical] Add column to point to the dataset/table from the dictionary #25

Closed fedorov closed 2 years ago

fedorov commented 2 years ago

Currently, we rely on naming conventions to dereference collection-specific dictionary elements to the specific tables, which works as long as we have everything in the same dataset.

As discussed yesterday, it would be more robust and easier to understand if we directly had a reference to the dataset/table, and it will also help us address use cases where we point to the external tables (if we need to), such as if we decide to point to CPTAC/TCGA tables in ISB-CGC.

G-White-ISB commented 2 years ago

Table name is now recorded literally in the clinical_meta and clinical_summary tables, ie no 'dereferencing' needed. Project and dataset are no recorded literally in the clinical_meta table. But the project and dataset columns should be moved from clinical_meta to clinical_summary. clinical_summary has table level meta information about the clinical tables, while clinical_meta has column level meta information about the clinical tables

fedorov commented 2 years ago

I am assuming the table names changed to be clinical_meta_table and clinical_meta_column. But it looks like we do not have any collections that would rely on tables from other projects integrated into this right now (ie, neither CPTAC nor TCGA collections are included). Should we go through the steps to integrate at least some project that relies on external tables to make sure the architecture of how things are working and organized can support external sources? Did we decide if we would replicate those external tables under versioned dataset, or indeed include external references?

G-White-ISB commented 2 years ago

I think we have not made a decision with respect to referencing or just duplicating external sources.

fedorov commented 2 years ago

I propose duplicating external sources. Those tables should not be large, and if we do not duplicate them, they can disappear or change at any moment.

G-White-ISB commented 2 years ago

Sure. I expect ISB-CGC is the only other entity pulling relevant data into BigQuery. In addition to CPTAC I know an ISB colleague will be gathering HTAN clinical data into BigQuery.

G-White-ISB commented 2 years ago

We are copying the CPTAC BQ table into per collection tables in our clinical dataset and recording the source BQ table in the table_metadata table. Suggest we can close this issue

fedorov commented 2 years ago

Just one clarification question. Currently, the source of CPTAC tables points to current (isb-cgc-bq.CPTAC.clinical_gdc_current). In a few months from now, it might be the case that current will be different. Can you discuss with the ISB-CGC folks if it makes sense to point to the actual numbered/versioned table instead of current, and note the response here?

fedorov commented 2 years ago

Superseded by #40