cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Acronyms for diseases #26

Closed dhimmel closed 7 years ago

dhimmel commented 8 years ago

In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (tcga_dictionary.txt). The contents are:

tissue acronym
adrenocortical cancer ACC
bladder urothelial carcinoma BLCA
breast invasive carcinoma BRCA
cervical & endocervical cancer CESC
cholangiocarcinoma CHOL
colon adenocarcinoma COAD
diffuse large B-cell lymphoma DLBC
esophageal carcinoma ESCA
glioblastoma multiforme GBM
head & neck squamous cell carcinoma HNSC
kidney chromophobe KICH
kidney clear cell carcinoma KIRC
kidney papillary cell carcinoma KIRP
acute myeloid leukemia LAML
brain lower grade glioma LGG
liver hepatocellular carcinoma LIHC
lung adenocarcinoma LUAD
lung squamous cell carcinoma LUSC
mesothelioma MESO
ovarian serous cystadenocarcinoma OV
pancreatic adenocarcinoma PAAD
pheochromocytoma & paraganglioma PCPG
prostate adenocarcinoma PRAD
rectum adenocarcinoma READ
sarcoma SARC
skin cutaneous melanoma SKCM
stomach adenocarcinoma STAD
testicular germ cell tumor TGCT
thyroid carcinoma THCA
thymoma THYM
uterine corpus endometrioid carcinoma UCEC
uterine carcinosarcoma UCS
uveal melanoma UVM
dhimmel commented 8 years ago

My questions are whether these acronyms are suitable for inclusion into automated workflows? For example, are they standardized across TCGA datasets? Furthermore, if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?

Also some brainstorming on areas where the acronyms are more useful than the full names.

gwaybio commented 8 years ago

are they standardized across TCGA datasets?

Yes, the acronyms are standardized - stricter than disease names

if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?

In almost every version of the clinical matrix I've seen, the disease is included with the acronym in two separate columns. This is definitely a concern with this iteration of the data, but considering the full data will be made public soon (late October, I think) we should be ok.

Also some brainstorming on areas where the acronyms are more useful than the full names.

  1. In all disease-specific plots the acronym will be better (takes up less space!)
  2. In the "disease selector" screen, the user can select based on acronym. TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those (E.g. in one of the original pan cancer studies - only 12 diseases, but I believe colors are picked for all 33)
dhimmel commented 8 years ago

@gwaygenomics nice --- the standardization makes me more comfortable here. I see many benefits to the abbreviations. For example, covariates.tsv would be much nicer with these space-free and short names.

What about the TCGA Study Abbreviations page from the Genomic Data Commons. Should we use it instead of the TCGA Data Portal, which claims to be deprecated?

TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those

Agree. I think we will want a file called diseases.tsv in this repository with columns for name, abbreviation, and color.

As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?

gwaybio commented 8 years ago

Yes, lets use the GDC.

As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?

Doesn't seem to be consistent anywhere I look. I've seen disease, tissue, cohort, acronym, and now abbreviation. I do not have a preference!

dhimmel commented 8 years ago

I compared the disease names in our diseases.tsv at 54140cf6addc48260c9723213c40b628d7c861da to the GDC listing. Since GDC seems to use sentence case whereas Xena Browser uses all lowercase, I converted GDC names to lowercase and looked for Xena diseases without a match. The following table shows my manual mapping of the diseases which didn't match:

Xena Browser Disease Name GDC Study Name GDC Study Abbreviation
adrenocortical cancer Adrenocortical carcinoma ACC
cervical & endocervical cancer Cervical squamous cell carcinoma and endocervical adenocarcinoma CESC
diffuse large B-cell lymphoma Lymphoid Neoplasm Diffuse Large B-cell Lymphoma DLBC
head & neck squamous cell carcinoma Head and Neck squamous cell carcinoma HNSC
kidney clear cell carcinoma Kidney renal clear cell carcinoma KIRC
kidney papillary cell carcinoma Kidney renal papillary cell carcinoma KIRP
pheochromocytoma & paraganglioma Pheochromocytoma and Paraganglioma PCPG
testicular germ cell tumor Testicular Germ Cell Tumors TGCT
uterine corpus endometrioid carcinoma Uterine Corpus Endometrial Carcinoma UCEC

Alerting @jingchunzhu and @maryjgoldman that the Xena disease names have diverged with the GDC names.

So I have a few thoughts/questions:

@gwaygenomics, you've convinced me that these abbreviations are important enough that we should add them to our workflow. Hopefully, we can find a solution on the upstream/automated side, but I'm willing to settle for a manual solution as a fallback.

maryjgoldman commented 8 years ago

Yes, we haven't started pulling data from the GDC yet (we're still using data from cgHub), so we haven't pulled in their names.

I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.

dhimmel commented 8 years ago

I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.

Hmm. The GDC mapping we found uses different disease names than Xena. @maryjgoldman, you're saying GDC would know how to resolve the differences?

I Googled for some the of the disease names from Xena and their abbreviations. There were really only three hits. Conveniently, the two GitHub hits are from @gwaygenomics and @jingchunzhu:

@jingchunzhu do you have any comments on cancerGroupTitle in TCGAUtil.py and whether this would be the right mapping for us?

gwaybio commented 8 years ago

@jingchunzhu - also (slightly unrelated), would you know where an updated code tables report is for TCGA barcodes? The code tables report page here is deprecated