Closed dhimmel closed 7 years ago
My questions are whether these acronyms are suitable for inclusion into automated workflows? For example, are they standardized across TCGA datasets? Furthermore, if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?
Also some brainstorming on areas where the acronyms are more useful than the full names.
are they standardized across TCGA datasets?
Yes, the acronyms are standardized - stricter than disease names
if additional diseases get added to Xena Browser data, do we want to create a a breaking dependency on manually adding the abbreviation?
In almost every version of the clinical matrix I've seen, the disease is included with the acronym in two separate columns. This is definitely a concern with this iteration of the data, but considering the full data will be made public soon (late October, I think) we should be ok.
Also some brainstorming on areas where the acronyms are more useful than the full names.
- In all disease-specific plots the acronym will be better (takes up less space!)
- In the "disease selector" screen, the user can select based on acronym. TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those (E.g. in one of the original pan cancer studies - only 12 diseases, but I believe colors are picked for all 33)
@gwaygenomics nice --- the standardization makes me more comfortable here. I see many benefits to the abbreviations. For example, covariates.tsv
would be much nicer with these space-free and short names.
What about the TCGA Study Abbreviations page from the Genomic Data Commons. Should we use it instead of the TCGA Data Portal, which claims to be deprecated?
TCGA disease-types also have designated colors for visualization purposes, we should also adhere to those
Agree. I think we will want a file called diseases.tsv
in this repository with columns for name
, abbreviation
, and color
.
As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?
Yes, lets use the GDC.
As far as nomenclature goes, should we use abbreviation over acronym -- as that's what TCGA seems to use?
Doesn't seem to be consistent anywhere I look. I've seen disease
, tissue
, cohort
, acronym
, and now abbreviation
. I do not have a preference!
I compared the disease names in our diseases.tsv
at 54140cf6addc48260c9723213c40b628d7c861da to the GDC listing. Since GDC seems to use sentence case whereas Xena Browser uses all lowercase, I converted GDC names to lowercase and looked for Xena diseases without a match. The following table shows my manual mapping of the diseases which didn't match:
Xena Browser Disease Name | GDC Study Name | GDC Study Abbreviation |
---|---|---|
adrenocortical cancer | Adrenocortical carcinoma | ACC |
cervical & endocervical cancer | Cervical squamous cell carcinoma and endocervical adenocarcinoma | CESC |
diffuse large B-cell lymphoma | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | DLBC |
head & neck squamous cell carcinoma | Head and Neck squamous cell carcinoma | HNSC |
kidney clear cell carcinoma | Kidney renal clear cell carcinoma | KIRC |
kidney papillary cell carcinoma | Kidney renal papillary cell carcinoma | KIRP |
pheochromocytoma & paraganglioma | Pheochromocytoma and Paraganglioma | PCPG |
testicular germ cell tumor | Testicular Germ Cell Tumors | TGCT |
uterine corpus endometrioid carcinoma | Uterine Corpus Endometrial Carcinoma | UCEC |
Alerting @jingchunzhu and @maryjgoldman that the Xena disease names have diverged with the GDC names.
So I have a few thoughts/questions:
@gwaygenomics, you've convinced me that these abbreviations are important enough that we should add them to our workflow. Hopefully, we can find a solution on the upstream/automated side, but I'm willing to settle for a manual solution as a fallback.
Yes, we haven't started pulling data from the GDC yet (we're still using data from cgHub), so we haven't pulled in their names.
I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.
I would check with the GDC for a mapping from disease to abbreviation, since they are the ones providing both points of data.
Hmm. The GDC mapping we found uses different disease names than Xena. @maryjgoldman, you're saying GDC would know how to resolve the differences?
I Googled for some the of the disease names from Xena and their abbreviations. There were really only three hits. Conveniently, the two GitHub hits are from @gwaygenomics and @jingchunzhu:
tcga_dictionary.tsv
TCGAUtil.py
. It looks like the cancerGroupTitle
dictionary could include the mapping we need.@jingchunzhu do you have any comments on cancerGroupTitle
in TCGAUtil.py
and whether this would be the right mapping for us?
@jingchunzhu - also (slightly unrelated), would you know where an updated code tables report
is for TCGA barcodes? The code tables report page here is deprecated
In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (
tcga_dictionary.txt
). The contents are: