Closed jharenza closed 5 years ago
@kgaonkar6, @jharenza, @shrivatsk working on
For models of sufficient dimensionality, we see these features coming up regularly: https://www.biorxiv.org/content/10.1101/573782v1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5728678/
Elastic net logistic regression over gene expression should produce a good classifier from RNA. It would be good to make sure to use a hold-one-histology-out type evaluation, just in case there are male/female distribution differences by histology.
@cgreene and @jaclyn-taroni - for the DNA sex prediction analysis we had performed, we used germline BAM files, which look like they are not included in the open CAVATICA projects, so for that specific analysis, do you think we should add the code to D3b github and add to the methods section and we can leave this open for an RNA analysis only?
I don't fully grasp the goals of this analysis yet. Is the idea to check the concordance with the sex
column in pbta_histologies.csv? If so, then I would consider whether or not it's part of the clinical data harmonization task:
https://alexslemonade.github.io/OpenPBTA-manuscript/#clinical-data-harmonization
If so, I think you could convert sex
to recorded_sex
and add a germline_sex_estimate
column to that file. Opening an issue for building a classifier for sex from the RNA expression data could also be helpful. It would be a good QC of our RNA-expression data, and also it should be quite accurate - probably enough that reporting concordance with both of the above could be interesting enough for a brief mention in the paper. Imagine folks who have a dataset with just RNA-expression data but who want to look at concordance with metadata: does RNA-based prediction agree at a similar level as germline-based? Finally, it also seems like a good_first_issue
since all of the required bits except these germline-based predictions are already present, and it sounds like those will make it into release V3.
Back story was we determined it to ensure our CNV benchmarking calls use the correct panel of normals (PON; male panel for male patients only, etc). Currently, CBTTC samples were paired T/N for CNVkit calls and PNOC samples were run using a mixed PON. The reason I really want to get those CNV calls on the sex chromosomes right is in the case of ATRX deletions (X chr).
We found 11 samples non-concordant with reported gender
and one which we deemed contaminated (this one did not make it into PBTA due to a T/N mismatch QC, but was in the run). We did this on RNA as well, but this method is clearly not the way to go for RNA.
2019-08-21-bfx scientific meeting-shrivats.pptx
I like the idea of adding that to clinical data harmonization methods - we will do that, and I can change the headers of that V3 clinical file as you suggest. Thanks!
The RNA-seq analysis there looks like you're just looking at the fraction mapping to Y. I think that's different than the analysis suggested in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/73#issuecomment-525274972, which should be more sensitive and specific. I still think the analysis proposed above could be helpful to compare. Some folks may have RNA-seq only cohorts, so knowing how a classifier built there would compare with germline could still be useful.
Is it also interesting that you're getting those mappings to the Y chromsome for females? I'm guessing these are mapping to the pseudoautosomal regions or potentially repeats. Would this be helpful for assessing alignment/calling approaches? I imagine this must already have been done by someone.
Agreed - we hadn't explored classifiers yet, but think we should! Yes - I think GATK has an input parameter dealing with sex that we currently don't use, but could improve calls on sex chromosomes.
I'd propose as a path forward:
Does this sound good? I have lots of meetings today, but I'll try to get the new issue filed at some point if you agree with this strategy.
Agreed!
Ok - I filed #84! I think the next step is to close this when the changes to histology land.
Scientific goals
What are the scientific goals of the analysis? Accurately predict sex of PBTA samples for downstream analyses
Proposed methods
What methods do you plan to use to accomplish the scientific goals? Ratio of Y:X+Y chromosomes XIST expression in RNA
Required input data
What input data will you use for this analysis? normal DNA BAMS RNA-Seq FPKM
Proposed timeline
What is the timeline for the analysis? One week
Relevant literature
If there is relevant scientific literature, put links to those items here.