Closed cgreene closed 4 years ago
I would like to give this a try. pbta-gene-expression-kallisto.rds looks like a good first source of gene expression estimates (1028 samples of 200,000+ transcript abundance values). pbta-histologies.tsv has an entry for each of these samples. Model construction will be with glmnet and glmnetUtils packages which allow tuning, through cross validation, of the regularization weight (lambda) and elastic net weight (alpha) simultaneously.
Welcome @bill-amadio - thanks for getting this analysis started! We can use this issue to discuss strategy and address any questions around the analysis you have.
@bill-amadio - very exciting! This plan sounds good - in the interests of compute time you may find that filtering to the transcripts that vary the most in an unsupervised way (something like the median absolute deviation) may make things much faster to run with little to no loss in accuracy. I'm looking forward to hearing how this goes!
Thanks @jaclyn-taroni and @cgreene - Quick question: are the calls from germline sequencing 100% accurate? If not, is there some estimate of the uncertainty in the process?
I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/73#issuecomment-525727865. In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.
Hi, Jackie and Casey,
I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.
This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.
I should be able to try the same code using the germline_sex-estimates next week.
Hope you both are well.
I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.
That's right, @bill-amadio - there was only one sample which landed between the two distributions and we had removed it from our dataset due to a previous QC using NGSCheckmate on its tumor/normal WGS data. We are presuming that germline sample was contaminated, hence why the fail on NGScheckmate and the mid-value for the sex prediction.
Great to have you working on the RNA side of this!
Hi, Jackie and Casey,
I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.
This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.
I should be able to try the same code using the germline_sex-estimates next week.
Hope you both are well.
Thanks for working on this! With regard to the duplicate sample, I noticed this and created an issue and this should be fixed in V4 of the data release, which we hope to have online by the end of next week.
It would be great if you wanted to start a pull request with your analysis code, link to this issue, and that way we can make direct comments within the PR :).
OK, @jharenza. Will do.
Apologies. This is my first multi-participant Git/GitHub project. I am not sure I have the symlinks and the pull request correct. I've reached out to a colleague here at Rider for some hand-holding. Will issue the pull request asap.
@bill-amadio No worries at all -- please let us know if there's anything we can do to help!
@jaclyn-taroni thank you so much. I would prefer to come to you to setup and launch this first pull request. I can get to you Monday or Wednesday of next week. Is there a time on either of those days that works for you?
Hi @bill-amadio - I will send you an email to coordinate.
At the moment, this module fails in CI intermittently. See https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/391. This may indicate an underlying issue with the reproducibility of the analysis but I'm not sure if it's because of how it is run in CI or a more general issue. It warrants further investigation
Probably not the cause of CI failure, but I do have difficulty running the analysis locally. For example, the required library e1071
is never actually loaded for 03-evaluate_model.R
Running w/in CI or Docker environment has likely been circumventing this bug, but the code needs to be explicit about what libraries are loaded and used.
EDIT: This was identified running the script while getting angsty for the Docker env to spin up.
The e1071
problem likely stems from the way caret
is installed (ref)
The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.
Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.
and loaded, as you point out @sjspielman. I've run into this issue before.
No chance the source of the error is in CI.
I had this error occur for the first time this week while looking at how predictive accuracy depends on the number of training transcripts. At low levels of training transcripts, all predictions are Male (the positive class for this dataset is Female), and the call to caret::twoClassSummary() fails. Calls to caret::confusionMatrix() work fine.
I've put all the twoClassSummary calls and saves of the resulting files inside try() functions. If a particular call fails, then there is no twoClassSummary file for that level of training transcripts in the results folder. This does not cause a problem further downstream with the results notebook.
The twoClassSummary reports only three values: ROC, sensitivity and specificity. Sensitivity and specificity are also in the Confusion Matrix output. Can we consider dropping the twoClassSummary from the 03-evaluate_model script? Or substituting a manual calculation and presentation if a test before calling twoClassSummary reveals zero predictions for one of the classes?
On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni notifications@github.com wrote:
The e1071 problem likely stems from the way caret is installed (ref https://cran.r-project.org/web/packages/caret/vignettes/caret.html)
The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.
Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.
and loaded, as you point out @sjspielman https://github.com/sjspielman. I've run into this issue before.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A .
I got the e1071 message from caret on my local machine. I may have been doing something that did not make it into the current release. Current scripts contain library(caret), but not library(e1071).
On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni notifications@github.com wrote:
The e1071 problem likely stems from the way caret is installed (ref https://cran.r-project.org/web/packages/caret/vignettes/caret.html)
The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.
Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.
and loaded, as you point out @sjspielman https://github.com/sjspielman. I've run into this issue before.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A .
Now that this module has a README as of #404, I believe this issue can be closed and if we decide to make changes, we can file a new Update an analysis
issue. Thanks for all your hard work @bill-amadio, congrats!
@jaclyn-taroni It was my pleasure. Thank you for the opportunity and for all your help.
Scientific goals
In https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/73 it's noted that the reported sex of the participants in the study didn't align with information from germline sequencing in 11 cases. It could be interesting to understand how accurately this can be called from gene expression data. For some studies, gene expression data is all that is available. If it can also be accurately called directly from gene expression data, these studies will be better positioned to evaluate their metadata.
Proposed methods
Required input data
For classifier construction:
For evaluation:
Proposed timeline
I am proposing this analysis but don't have time to do it, so I would leave this as an estimate for someone who decides to take this on.
Relevant literature
There are a number of reports of this being readily discoverable, even with unsupervised methods. These are two from our group where we noticed this, but there are also others from other groups: