Process the clinical matrix to extract sample attributes

ypar commented 8 years ago

An issue has been raised in today's meeting.

The clinical matrix should be carefully analyzed to select a specific covariate or a set of covariates we can use for analyses.

The relevant notebook is here tcga notebook for data download and the dataset is named PANCAN-clinicalMatrix

dhimmel commented 8 years ago

We would like to extract sample information for two purposes:

Enabling sample selection by frontend users (see https://github.com/cognoma/cancer-data/issues/13)
Covariates to prevent confounding of our classifiers (see https://github.com/cognoma/machine-learning/issues/21)

gwaybio commented 8 years ago

Enabling sample selection by frontend users (see #13)

To begin building a sample selector I don't think we need more info beyond mapping sample ID to tissue. Mappings are in the clinical matrix. Also, here is a text file holding tissue and TCGA acronym info: tcga_dictionary.txt.

The more that I think about it, the more I am liking the idea of scraping the sample selector all together. In this scenario the gene mutation selector aka status selector communicates with a backend process that curates the tissues that have enough mutations compared with the gene list specified (I have been using tissues with >= 10 mutations for inclusion). Then, the X matrix is subset to only those sample IDs belonging to those tissues that have enough mutation positives. We can then report classifier performance stratified by tissue.

I think having a service that describes the mutations across tissues/genders/age/etc. would be great but we have to be careful as to not reinvent the wheel here since many other services already do this. See COSMIC, NCI GDC, Broad Firehose, or CBioPortal

dhimmel commented 8 years ago

@gwaygenomics, provenance of tcga_dictionary.txt?

gwaybio commented 8 years ago

@dhimmel my keyboard!

ypar commented 8 years ago

A few questions.

re: tissue dictionary Attached tcga_distionary.txt seems to be a dict for the primary disease and not the tissue. Also i noticed that the primary sites and sample collection include both normal and tumor tissues. Are these treated non-discriminantly as far as feature selections go?

re: mutations Also to consider is what we want to tell by having a mutation selector. Are we providing some sort of a risk score? Are we simply counting? Is it used just as a QC threshold? Would these be compared to known databases such as EXaC or ClinVar or HGMD?

re: covariates Do we have plans for how to handle missing data? The ClinicalMatrix has less missing data than most other clinical data sheets I've seen but it is not trivial. Are we considering specific variables for covariates or meta-variables selected by, e.g. PCA? Either way, if we are considering including such covariates for analysis (i.e. beyond their usage for sample selection), I think it should be explicitly stated. e.g. we can exclude all samples without sex in the dict if we deem sex to be a crucial confounder in our analysis.

gwaybio commented 8 years ago

@ypar thanks for these questions!

re: tissue dictionary

The TCGA acronym is how they identify "tissue source site" but you're right, they're not strictly "tissues" and "diseases" would be more appropriate. E.g. LUAD is "lung adenocarcinoma" and LUSC is "lung squamous cell carcinoma". TCGA has adopted this broad terminology however and to keep consistent, so will we. You're point about tumor vs. normal is definitely something we should consider in the final model. We'll need to filter out "normal" which is really "adjacent normal" - normal tissue from the same individual taken from close proximity within the actual tumor debulking surgery. We will also probably want to filter out "metastasis" and patients measured twice. Much of this sample curation is performed before the data is made public - but a lot is left in intentionally, or sneaks past the filters. We can use a combination of the representative columns and official TCGA Barcodes to create an official sample list. For unsupervised feature construction however, I think it is important to leave all the samples in!

re: mutations

Right now the mutation selector is as follows: user select a gene or genes, cognoma builds a Y matrix of 1's and 0's corresponding to samples in the expression matrix (X) indicating presence or absence of mutation. I think this is the minimum case example and should be focused on getting implemented before we try to get fancy. How we define impactful mutations is another story. We will use the official mutation calls from the mutation matrix to determine if the sample has a mutation in a given input gene. Currently, the plan is to filter only silent mutations but we could also have a stronger threshold which would reference a database. I think referencing a database would strongly benefit certain genes (like oncogenes where there are known activating mutations) but also limit the power for other genes (like tumor suppressor genes where there are several known and unknown inactivating mutations along the gene body)

re: covariates

I am not sure how to handle covariates at the moment... I think some sort of adjustment should be discussed but I don't know of the optimal solution. Right now I'm think it would be best to include performance of the model across different covariates in the results viewer.

dhimmel commented 8 years ago

For unsupervised feature construction however, I think it is important to leave all the samples in!

@gwaygenomics in the case of multiple samples per individual are you sure we want to leave those in? I think some unsupervised approaches will assume independent observations.

Currently, the plan is to filter only silent mutations but we could also have a stronger threshold which would reference a database.

That's not our current implementation. We ignore all code orange and code green mutations, based on a classification system developed by the Xena Browser team. See https://github.com/cognoma/cancer-data/issues/2#issuecomment-233088248 for more information and a table of mutation counts by classification.

gwaybio commented 8 years ago

I think some unsupervised approaches will assume independent observations.

Good point. Yeah, we should remove those

ypar commented 8 years ago

I think some unsupervised approaches will assume independent observations.

Good point. Yeah, we should remove those

IMO, it is particularly important for unsupervised methods to have a cleanest possible data although one could argue that it is equally important for supervised methods.
e.g. if you do not have proper treatments of confounders and missing values, the first cluster will merely pick out precisely that information and that information only.

dhimmel commented 8 years ago

Discussion on this issue has become off topic. So if we want to keep discussing issues that are not related to processing the PANCAN-clinicalMatrix dataset to extract sample information, let's make new discussions or find an existing discussion that is topical.

cognoma / cancer-data

Process the clinical matrix to extract sample attributes #10