cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Variable documentation for Xena Browser's PANCAN_clinicalMatrix #14

Closed Inquisitive-Geek closed 7 years ago

Inquisitive-Geek commented 7 years ago

Hi @dhimmel ,

The documentation links provided for the 3 datasets did not explain the variables involved clearly. It would be great if you could share some links around that.

Thanks, Roshan

dhimmel commented 7 years ago

@Inquisitive-Geek, I believe you're referring to these three downloads and corresponding links:

The links are to the Xena Browser info pages, since Xena is team that makes the data. These pages don't provide much documentation of each column. I know Xena may have some additional documentation on various help pages.

@gwaygenomics or @jingchunzhu do you know of any documentation of what each variable means in PANCAN_clinicalMatrix?

@Inquisitive-Geek if you have questions about specifics columns then we can provide our best guess. For more authoritative documentation, I recommend messaging the UCSC Xena Browser Google Group. They've been really helpful so far and can likely give the best answers for these questions.

Inquisitive-Geek commented 7 years ago

Please correct me wherever I am wrong as my knowledge of genomics is nill.

Thanks. Let's start with the clinical matrix dataset. Here's what I understand from variables whose names start with _GENOMIC_ID_TCGAPANCAN.. for eg. _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27

They seem to be some sort of flag variable denoting the gene (eg. HumanMethylation27) present in the sample. If the value is not NaN (it looks like it is the patient ID when it isn't), then the gene is present in the sample.

Also, I did not understand what _RFS, _RFS_UNIT & _RFS_IND mean. It seems like _TIME_TO_EVENT means the time it took for the cell to mutate.

jingchunzhu commented 7 years ago

Forward to google group.

Update by @dhimmel: see the Google Group post here https://groups.google.com/forum/#!topic/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q.

jingchunzhu commented 7 years ago

Identifiers:

_EVENT: event in this case it is overall survival event _INTEGRATION: id used for integrating data on the xena browser and across cohort _OS : overall survival time _OS_IND : overall survival event _OS_UNIT: overall survival time unit _PANCAN_CNA_PANCAN_K8: 2012 pancan paper publication data _PANCAN_Cluster_Cluster_PANCAN: 2012 pancan paper publication data _PANCAN_DNAMethyl_PANCAN: 2012 pancan paper publication data _PANCAN_RPPA_PANCAN_K8: 2012 pancan paper publication data _PANCAN_UNC_RNAseq_PANCAN_K16: 2012 pancan paper publication data _PANCAN_miRNA_PANCAN: 2012 pancan paper publication data _PANCAN_mutation_PANCAN: 2012 pancan paper publication data _PATIENT: TCGA patient id _RFS: recurrent free survival (xena curated, note: i trust the overall survival data much better) _RFS_IND: recurrece free survival event _RFS_UNIT: RFS time unit _TIME_TO_EVENT: time to event (in this case, it is exactly like overall survival event) _TIME_TO_EVENT_UNIT: time unit _cohort: cohort name (also used as cohort id) _primary_disease: primary_disease _primary_site: primary organ of origin age_at_initial_pathologic_diagnosis gender sampleID: sample id (same as _INTEGRATION) sample_type: sample type sample_type_id

anything start with _GENOMIC_ID holds legacy mapping information of the original uuids from TCGD DCC (which has been replaced with GDC), therefore I don't think any of these mappings is going to useful anymore, at least to vast majority of people.

also, note you can take a look of the dataset detail page at https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?dataset=TCGA.PANCAN.sampleMap/PANCAN_clinicalMatrix&host=https://tcga.xenahubs.net

then, click on "all identifiers" link to see all the variables available: https://genome-cancer.soe.ucsc.edu/proj/site/xena/datapages/?host=https%3A%2F%2Ftcga.xenahubs.net&dataset=TCGA.PANCAN.sampleMap%2FPANCAN_clinicalMatrix&label=Phenotypes&allIdentifiers=true

Jing

jingchunzhu commented 7 years ago

And to follow up, '_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27' is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset: https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data.

For _RFS, _RFS_UNIT _RFS_IND and _TIME_TO_EVENT, please see this help page: http://xena.ucsc.edu/km-plot-help/. _RFS is 'recurrence free survival'

Author: Mary Goldman Source : https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/9Q2b3QPoAQAJ

dhimmel commented 7 years ago

@jingchunzhu / Mary -- are the "Sample IDs" in Xena Browser:

  1. TCGA Barcodes?
  2. TCGA UUIDs?
  3. Xena-specific identifiers?

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

It now makes sense now why fields like _PANCAN_mutation_PANCAN are encoded as missing / sample_id rather than binary (0 / 1).

jingchunzhu commented 7 years ago

are the "Sample IDs" in Xena Browser TCGA Barcodes, TCGA UUIDs, or Xena-specific identifiers?

​Sample IDs in Xena Browser is TCGA Barcode, in particular, at the sample level ​TCGA gives different IDs https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data. The reason you use this is to get the best integration of the various of genomics data types. You go with level below samples, you will have a lot of more entities with missing dimentions like there is mutation data but no expression data. If you go with patient level, then u will have to handle primary tumor, recurrent tumor and mostly normal sample from the same patient, essentially you probably will end out throw out normal sample data.

_GENOMIC_ID_TCGA_PANCAN_HumanMethylation27 is denoting the genomic sample ID in the Methylations 27K dataset. TCGA gives different IDs for each sample in each dataset.

​I can't tell​ if there is question about _GENOMIC_ID_TCGA_PANCAN_HumanMethylation27?​

Source: https://groups.google.com/d/msg/ucsc-cancer-genomics-browser/Hmj3JTzOz0Q/-vJtHmN4AwAJ

dhimmel commented 7 years ago

I don't think we have any outstanding questions related to variables in PANCAN_clinicalMatrix, so I'm going to close this issue.

For future Xena questions, we can open new issues and mention @jingchunzhu and @maryjgoldman, who are part of the Xena team and have graciously offered their support.