AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 68 forks source link

Proposed Analysis: Sex prediction on PBTA cohort #73

Closed jharenza closed 5 years ago

jharenza commented 5 years ago

Scientific goals

What are the scientific goals of the analysis? Accurately predict sex of PBTA samples for downstream analyses

Proposed methods

What methods do you plan to use to accomplish the scientific goals? Ratio of Y:X+Y chromosomes XIST expression in RNA

Required input data

What input data will you use for this analysis? normal DNA BAMS RNA-Seq FPKM

Proposed timeline

What is the timeline for the analysis? One week

Relevant literature

If there is relevant scientific literature, put links to those items here.

jharenza commented 5 years ago

@kgaonkar6, @jharenza, @shrivatsk working on

cgreene commented 5 years ago

For models of sufficient dimensionality, we see these features coming up regularly: https://www.biorxiv.org/content/10.1101/573782v1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5728678/

Elastic net logistic regression over gene expression should produce a good classifier from RNA. It would be good to make sure to use a hold-one-histology-out type evaluation, just in case there are male/female distribution differences by histology.

jharenza commented 5 years ago

@cgreene and @jaclyn-taroni - for the DNA sex prediction analysis we had performed, we used germline BAM files, which look like they are not included in the open CAVATICA projects, so for that specific analysis, do you think we should add the code to D3b github and add to the methods section and we can leave this open for an RNA analysis only?

cgreene commented 5 years ago

I don't fully grasp the goals of this analysis yet. Is the idea to check the concordance with the sex column in pbta_histologies.csv? If so, then I would consider whether or not it's part of the clinical data harmonization task: https://alexslemonade.github.io/OpenPBTA-manuscript/#clinical-data-harmonization

If so, I think you could convert sex to recorded_sex and add a germline_sex_estimate column to that file. Opening an issue for building a classifier for sex from the RNA expression data could also be helpful. It would be a good QC of our RNA-expression data, and also it should be quite accurate - probably enough that reporting concordance with both of the above could be interesting enough for a brief mention in the paper. Imagine folks who have a dataset with just RNA-expression data but who want to look at concordance with metadata: does RNA-based prediction agree at a similar level as germline-based? Finally, it also seems like a good_first_issue since all of the required bits except these germline-based predictions are already present, and it sounds like those will make it into release V3.

jharenza commented 5 years ago

Back story was we determined it to ensure our CNV benchmarking calls use the correct panel of normals (PON; male panel for male patients only, etc). Currently, CBTTC samples were paired T/N for CNVkit calls and PNOC samples were run using a mixed PON. The reason I really want to get those CNV calls on the sex chromosomes right is in the case of ATRX deletions (X chr).

We found 11 samples non-concordant with reported gender and one which we deemed contaminated (this one did not make it into PBTA due to a T/N mismatch QC, but was in the run). We did this on RNA as well, but this method is clearly not the way to go for RNA. 2019-08-21-bfx scientific meeting-shrivats.pptx

I like the idea of adding that to clinical data harmonization methods - we will do that, and I can change the headers of that V3 clinical file as you suggest. Thanks!

cgreene commented 5 years ago

The RNA-seq analysis there looks like you're just looking at the fraction mapping to Y. I think that's different than the analysis suggested in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/73#issuecomment-525274972, which should be more sensitive and specific. I still think the analysis proposed above could be helpful to compare. Some folks may have RNA-seq only cohorts, so knowing how a classifier built there would compare with germline could still be useful.

Is it also interesting that you're getting those mappings to the Y chromsome for females? I'm guessing these are mapping to the pseudoautosomal regions or potentially repeats. Would this be helpful for assessing alignment/calling approaches? I imagine this must already have been done by someone.

jharenza commented 5 years ago

Agreed - we hadn't explored classifiers yet, but think we should! Yes - I think GATK has an input parameter dealing with sex that we currently don't use, but could improve calls on sex chromosomes.

cgreene commented 5 years ago

I'd propose as a path forward:

Does this sound good? I have lots of meetings today, but I'll try to get the new issue filed at some point if you agree with this strategy.

jharenza commented 5 years ago

Agreed!

cgreene commented 5 years ago

Ok - I filed #84! I think the next step is to close this when the changes to histology land.

jharenza commented 5 years ago

88 V3 histologies list updated, so closing this