Proposed Analysis: RNA Expression-based Prediction of Sex

cgreene commented 5 years ago

Scientific goals

In https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/73 it's noted that the reported sex of the participants in the study didn't align with information from germline sequencing in 11 cases. It could be interesting to understand how accurately this can be called from gene expression data. For some studies, gene expression data is all that is available. If it can also be accurately called directly from gene expression data, these studies will be better positioned to evaluate their metadata.

Proposed methods

Construct an elastic-net logistic regression classifier using gene expression values as features and reported sex as the labels. I suggest elastic net because the signal is expected to be relatively linear, most genes are expected to play little role, but a few sets of genes are likely to be predictive and highly correlated, and it makes sense to spread weights across them to produce a robust predictor.
Evaluate using cross-validation with reported labels across the full set.
Evaluate using cross-validation with germline-based labels across the full set
Evaluation prediction accuracy using germline sequencing within each histology.

Required input data

For classifier construction:

The reported sex in the PBTA histologies file.
Gene expression estimates from kallisto, RSEM, or both.

For evaluation:

The germline-based sequencing calls.
Histologies, so that performance can be broken out by histology as well.

Proposed timeline

I am proposing this analysis but don't have time to do it, so I would leave this as an estimate for someone who decides to take this on.

Relevant literature

There are a number of reports of this being readily discoverable, even with unsupervised methods. These are two from our group where we noticed this, but there are also others from other groups:

Unsupervised VAE -> feature aligned with sex: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5728678/
Multiple unsupervised methods -> feature aligned with sex: https://www.biorxiv.org/content/10.1101/573782v1

bill-amadio commented 5 years ago

I would like to give this a try. pbta-gene-expression-kallisto.rds looks like a good first source of gene expression estimates (1028 samples of 200,000+ transcript abundance values). pbta-histologies.tsv has an entry for each of these samples. Model construction will be with glmnet and glmnetUtils packages which allow tuning, through cross validation, of the regularization weight (lambda) and elastic net weight (alpha) simultaneously.

jaclyn-taroni commented 5 years ago

Welcome @bill-amadio - thanks for getting this analysis started! We can use this issue to discuss strategy and address any questions around the analysis you have.

cgreene commented 5 years ago

@bill-amadio - very exciting! This plan sounds good - in the interests of compute time you may find that filtering to the transcripts that vary the most in an unsupervised way (something like the median absolute deviation) may make things much faster to run with little to no loss in accuracy. I'm looking forward to hearing how this goes!

bill-amadio commented 5 years ago

Thanks @jaclyn-taroni and @cgreene - Quick question: are the calls from germline sequencing 100% accurate? If not, is there some estimate of the uncertainty in the process?

cgreene commented 5 years ago

I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/73#issuecomment-525727865. In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.

bill-amadio commented 5 years ago

UnderstandKallistoRDS.pdf

Hi, Jackie and Casey,

I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.

This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.

I should be able to try the same code using the germline_sex-estimates next week.

Hope you both are well.

jharenza commented 5 years ago

I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.

That's right, @bill-amadio - there was only one sample which landed between the two distributions and we had removed it from our dataset due to a previous QC using NGSCheckmate on its tumor/normal WGS data. We are presuming that germline sample was contaminated, hence why the fail on NGScheckmate and the mid-value for the sex prediction.

Great to have you working on the RNA side of this!

jharenza commented 5 years ago

UnderstandKallistoRDS.pdf

Hi, Jackie and Casey,

I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.

This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.

I should be able to try the same code using the germline_sex-estimates next week.

Hope you both are well.

Thanks for working on this! With regard to the duplicate sample, I noticed this and created an issue and this should be fixed in V4 of the data release, which we hope to have online by the end of next week.

It would be great if you wanted to start a pull request with your analysis code, link to this issue, and that way we can make direct comments within the PR :).

bill-amadio commented 5 years ago

OK, @jharenza. Will do.

bill-amadio commented 5 years ago

Apologies. This is my first multi-participant Git/GitHub project. I am not sure I have the symlinks and the pull request correct. I've reached out to a colleague here at Rider for some hand-holding. Will issue the pull request asap.

jaclyn-taroni commented 5 years ago

@bill-amadio No worries at all -- please let us know if there's anything we can do to help!

bill-amadio commented 5 years ago

@jaclyn-taroni thank you so much. I would prefer to come to you to setup and launch this first pull request. I can get to you Monday or Wednesday of next week. Is there a time on either of those days that works for you?

jaclyn-taroni commented 5 years ago

Hi @bill-amadio - I will send you an email to coordinate.

jaclyn-taroni commented 4 years ago

At the moment, this module fails in CI intermittently. See https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/391. This may indicate an underlying issue with the reproducibility of the analysis but I'm not sure if it's because of how it is run in CI or a more general issue. It warrants further investigation

sjspielman commented 4 years ago

Probably not the cause of CI failure, but I do have difficulty running the analysis locally. For example, the required library e1071 is never actually loaded for 03-evaluate_model.R Running w/in CI or Docker environment has likely been circumventing this bug, but the code needs to be explicit about what libraries are loaded and used. EDIT: This was identified running the script while getting angsty for the Docker env to spin up.

jaclyn-taroni commented 4 years ago

The e1071 problem likely stems from the way caret is installed (ref)

The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.

Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.

and loaded, as you point out @sjspielman. I've run into this issue before.

bill-amadio commented 4 years ago

No chance the source of the error is in CI.

I had this error occur for the first time this week while looking at how predictive accuracy depends on the number of training transcripts. At low levels of training transcripts, all predictions are Male (the positive class for this dataset is Female), and the call to caret::twoClassSummary() fails. Calls to caret::confusionMatrix() work fine.

I've put all the twoClassSummary calls and saves of the resulting files inside try() functions. If a particular call fails, then there is no twoClassSummary file for that level of training transcripts in the results folder. This does not cause a problem further downstream with the results notebook.

The twoClassSummary reports only three values: ROC, sensitivity and specificity. Sensitivity and specificity are also in the Confusion Matrix output. Can we consider dropping the twoClassSummary from the 03-evaluate_model script? Or substituting a manual calculation and presentation if a test before calling twoClassSummary reveals zero predictions for one of the classes?

On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni notifications@github.com wrote:

The e1071 problem likely stems from the way caret is installed (ref https://cran.r-project.org/web/packages/caret/vignettes/caret.html)

The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.

Install caret using

install.packages("caret", dependencies = c("Depends", "Suggests"))

to ensure that all the needed packages are installed.

and loaded, as you point out @sjspielman https://github.com/sjspielman. I've run into this issue before.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A .

bill-amadio commented 4 years ago

I got the e1071 message from caret on my local machine. I may have been doing something that did not make it into the current release. Current scripts contain library(caret), but not library(e1071).

On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni notifications@github.com wrote:

The e1071 problem likely stems from the way caret is installed (ref https://cran.r-project.org/web/packages/caret/vignettes/caret.html)

The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.

Install caret using

install.packages("caret", dependencies = c("Depends", "Suggests"))

to ensure that all the needed packages are installed.

and loaded, as you point out @sjspielman https://github.com/sjspielman. I've run into this issue before.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A .

jaclyn-taroni commented 4 years ago

Now that this module has a README as of #404, I believe this issue can be closed and if we decide to make changes, we can file a new Update an analysis issue. Thanks for all your hard work @bill-amadio, congrats!

bill-amadio commented 4 years ago

@jaclyn-taroni It was my pleasure. Thank you for the opportunity and for all your help.

AlexsLemonade / OpenPBTA-analysis