Hi, all.
I'm only emailing this to the three of you since you directly deal with TCGA annotation data (plus, I don't want to raise panic in others). This only refers to TCGA pancancer atlas and Treehouse annotations, not pancan12.
While using some of the TCGA annotations we've wrangled while I was at UCSC to look up information about some samples, I found a few discrepancies. Nothing big but twice the annotations in the file didn't match the annotations in Xena. This was enough to give me doubt about the accuracy of every annotation in the file. Enough doubt to re-download and re-wrangle the core annotation data from Xena. I am not sure when exactly things got off the track. This file changed hands several times and different columns where wrangled by different people. It doesn't really matter. Attached is the re-wrangled file I will be using for my purposes and you are welcome to use it too.
Few notes:
I downloaded phenotype information for each tumor type from Xena TCGA hub.
From each of those files I pulled the following information:
sampleID
disease
primary_site
age_at_dx
gender
grade
M_stage
N_stage
T_stage
stage
site_details
subtype
breast_carcinoma_estrogen_receptor_status
breast_carcinoma_progesterone_receptor_status
tumor_normal
sample_type
Where both pathological and clinical stage were available I used pathological stage, clinical otherwise.
Subtype information was a bit of a manual curation; for some cancers it's the histological subtype, for others it's the molecular subtype. For glioblastomas, lower grade gliomas, testicular germline tumor, mesothelioma, and sarcoma the subtypes came from the paper-ready annotation freeze (since I was a part of those working groups). For BRCA the subtypes are PAM50 calls from the 2012 Nature paper. There are later annotations available from 2015 Cell paper. You are welcome to update those. I am including the PAM50 files with some comparisons between these calls I produced earlier for something else.
As far as grade, some tumor types do not have grade annotations. The only one I populated was glioblastoma grade, which is not annotated but we know it's G4.
You will need to change the IDs to match how you ID TCGA samples in the Treehouse data now.
Hi, all. I'm only emailing this to the three of you since you directly deal with TCGA annotation data (plus, I don't want to raise panic in others). This only refers to TCGA pancancer atlas and Treehouse annotations, not pancan12.
While using some of the TCGA annotations we've wrangled while I was at UCSC to look up information about some samples, I found a few discrepancies. Nothing big but twice the annotations in the file didn't match the annotations in Xena. This was enough to give me doubt about the accuracy of every annotation in the file. Enough doubt to re-download and re-wrangle the core annotation data from Xena. I am not sure when exactly things got off the track. This file changed hands several times and different columns where wrangled by different people. It doesn't really matter. Attached is the re-wrangled file I will be using for my purposes and you are welcome to use it too.
Few notes:
Let me know if you have any questions.
Yulia
annotations.tcga.txt BRCA_pam50.cell_2015.txt BRCA_pam50.nature2012.txt
brca_subtype_calls_comparisons_2.pdf