Proposed Analysis: Comparative RNA-Seq analysis

GeoffLyle commented 4 years ago

Scientific goals

Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data.
Generate gene outlier thresholds for ribodeplete samples.
List outlier genes for each ribodeplete sample.
Discover trends in outlier genes by tumor subtype.
Investigate possible targeted therapeutics based on outlier expression profile.

Proposed methods

The correlations will be calculated using pairwise Spearman correlation of the RNA-Seq gene expression profiles.
The gene expression profile data will be filtered using a developed method described in Vaske et al. Note: This process is the precursor to creating a TumorMap (https://tumormap.ucsc.edu/). While programmatic creation of a TumorMap is not currently open access, we can add instructions for users on how to manually generate a TumorMap on the website.
Outlier thresholds will be calculated according to the Tukey method.
Discovery of outlier genes and possible targeted therapeutics will be based on the Treehouse CARE method. https://github.com/UCSC-Treehouse/CARE
- The CARE method will be used to find outliers using the entire population as a cohort as well as subsamples from the population to disentangle sample specific outliers from tumor subtype outliers.
Application of novel transfer learning method trained on CCLE response data to Open-PBTA samples will predict response to therapeutics.

Required input data

pbta-gene-expression-kallisto.polya.rds pbta-gene-expression-kallisto.stranded.rds

Is it possible to get RSEM TPM values for this data? That would make it easier for us to adapt existing algorithms to work with the Open-PBTA data.

Proposed timeline

8 weeks

Relevant literature

Method published in [Vaske et al. Jama Open Network. 2019.] (https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2753519)

jharenza commented 4 years ago

hi @GeoffLyle! Thanks for proposing this analysis! We currently have RSEM FPKM values available in our data release: pbta-gene-expression-rsem-fpkm.polya.rds and pbta-gene-expression-rsem-fpkm.stranded.rds. Since FPKM and TPM from RSEM are highly correlated, will this do, or does your analysis require TPM? If the latter, I can file an issue to add the TPM data during our next sprint, which starts Nov. 13 and the data could be expected sometime around the end of next week.

GeoffLyle commented 4 years ago

@jharenza I believe our analysis could work with the FPKM or Kallisto TPM values. However, our current workflow is set up tot use RSEM TPM values and our learning model was trained on TPM data. Since our team is much more comfortable working with RSEM TPM, it would be great if that data could be provided. By the end of next week works perfectly well for our timeline. Thanks!

jharenza commented 4 years ago

OK, will add to our sprint, thanks!

GeoffLyle commented 4 years ago

Hi @jharenza, I talked with our bioinformatician and she informed me that we are interested in both Transcript Level and Gene Level RSEM TPM. Would it possible to generate both of these files during your sprint?

jharenza commented 4 years ago

@GeoffLyle will do both!

jaclyn-taroni commented 4 years ago

Hi @GeoffLyle I'm looking forward to these analyses. Thank you for proposing them and thanks @jharenza for your team's help with the RSEM TPM.

I wanted to ask about how you were planning to split up the analyses for submission. Would each bullet point under scientific goals be it's own pull request or will some steps be grouped together?

I also wanted to note that we are planning to flesh out the molecular subtype information (#19) which may be of scientific interest here. If so, the column molecular_subtype in the pbta-histologies.tsv that is included in the data download would be the one that gets filled in with more information and is therefore the one you may want use during development.

GeoffLyle commented 4 years ago

@jaclyn-taroni Our submission of analyses will follow the scientific goals bullet points. Our first analysis will definitely be Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data. The Generate gene outlier thresholds for ribodeplete samples. and List outlier genes for each ribodeplete sample. may be two separate analyses or combined into one pull request depending on the speed and ease of engineering our algorithm to work with the data. The Discover trends in outlier genes by tumor subtype. and Investigate possible targeted therapeutics based on outlier expression profile. are broad goals we are still determining how to implement. Typically this analysis is done on a N-of-1 patient sample so I will have to explore the data to determine what trends there are and how best to present them. Also we have some modeling software our graduate students have been working on that we would like to adapt to the Open-PBTA data and submit as an analysis.

In short, the first 3 scientific goals will be submitted as 2 or 3 pull requests. The last two scientific goals may be submitted in 2+ pull requests.

Thanks for the heads up on the pbta-histologies.tsv update. We planned to look into differences by disease and histology. It would be interesting to see if there is a signal when grouping by molecular subtype.

jaclyn-taroni commented 4 years ago

Sounds good @GeoffLyle, thank you! If you have any questions that pop up as you're implementing the correlation matrices step, please let us know.

jharenza commented 4 years ago

Hi @GeoffLyle - hopefully you have been following, but the TPM data you requested became available with release V10 (#273). Excited to see your PRs come through!

GeoffLyle commented 4 years ago

@jharenza Thank you! Saw the V10 release. I'll let the team know about the V11 release. Should be a good test to ensure everything still runs correctly.

jharenza commented 4 years ago

@GeoffLyle ahh spoke too soon and corrected it. @jaclyn-taroni is running V11 tests currently, then we will merge once everything works properly :).

jaclyn-taroni commented 4 years ago

Hi @GeoffLyle and @e-t-k, I wanted to check in. Do you have an idea of when you expect to file a pull request for the outlier thresholds and outlier gene lists? Also, if you are working with the release-v12-20191217 data, I wanted to make you aware that there are stranded poly-A samples in the stranded dataset files, which are mostly comprised of ribodepleted samples (see #374) though this very well may have been apparent from your results. Those samples will come out in the v13 release, which is planned for next week (#373).

e-t-k commented 4 years ago

Hi @jaclyn-taroni , I've been working on some other priorities recently, so current estimate for this is early Feb, but I could look into making it sooner if that's behind why you're asking.

And thank you for the heads up regarding the stranded PolyA samples; your exclusion of them in the next release works just fine for us. @GeoffLyle also noticed there are some cell line samples in the data set that we'll want to exclude as well, which we were expecting to just implement as a filtering step.

jaclyn-taroni commented 4 years ago

Hi @e-t-k,

Thanks for your reply!

I've been working on some other priorities recently, so current estimate for this is early Feb, but I could look into making it sooner if that's behind why you're asking.

We are looking to wind down the analysis phase of the project and prepare the first draft of the OpenPBTA manuscript relatively soon.

If you would like this analysis to be included in the first version of the manuscript, I would recommend filing a pull request as soon as its possible. There is the option of including this analysis for revisions. (For the Deep Review paper that @cgreene previously organized, they continued to add things during revisions beyond what reviewers asked for.)

There also may be more lag time between when a pull request is initially filed and when it is reviewed as we move our focus to writing the manuscript.

Whatever you decide works for us, but I wanted to be transparent about what the next few months of the project will look like! We'd love to include this in the first version if the timing works out.

And thank you for the heads up regarding the stranded PolyA samples; your exclusion of them in the next release works just fine for us.

release-v13-20200116 went out yesterday (#444), so if you're up to date with the AlexsLemonade master branch you'll be able to snag that by rerunning the download bash script.

@GeoffLyle also noticed there are some cell line samples in the data set that we'll want to exclude as well, which we were expecting to just implement as a filtering step.

Ah yes, sounds good. If you're looking to limit your analysis to tumor samples, the pattern that we've used throughout the project is to filter the pbta-histologies.tsv file to rows where sample_type is Tumor and composition is Solid Tissue. (Here's an example in R: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/c96bac0807577f30de6be13f879f2360096b03c9/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd#L47.) If you further filter such that all rows are RNA-Seq in experimental_strategy, the Kids_First_Biospecimen_ID will contain all the assay identifiers you're looking for. You may have gotten to the bottom of this already, but the presence of cell lines and different identifiers tripped me up when I first started working with the OpenPBTA data :relaxed:

GeoffLyle commented 3 years ago

@jaclyn-taroni We want to add our cohort finding script to this analysis. As it uses the correlation matrix that is already created for finding the gene outlier status of samples, I believe it would fit into this analysis rather than being a separate proposed analysis.

The idea is to calculate the Spearman distance between a focus sample and all other samples. If the Spearman distance is above a certain threshold (typically above the 95th percentile of all Spearman distances), those samples are added to a cohort called "First-Degree Neighbors". A second cohort called "First and Second-Degree Neighbors" contains the first degree neighbors as well as those samples first degree neighbors. A third cohort is "Diseases of top 6 samples", which is a cohort containing all samples with matching diseases as the top 6 samples with the highest Spearman score.

This will involve adapting a currently used script as well as harmonizing the diseases found in the "disease_type_new" column from pbta_histologies.tsv. We believe this level of analysis will not cause any issues with the CI.

jaclyn-taroni commented 3 years ago

Hi @GeoffLyle, adding what you've described to analyses/comparative-RNASeq-analysis/ makes sense to me!

We believe this level of analysis will not cause any issues with the CI.

To confirm, you believe that it will run with 8GB RAM and in a reasonable amount of time in CI (10 minutes or less without some kind of output), is that correct?

I wanted to point you to how to obtain the files we use for CI in case you'd like to do some further testing: https://github.com/AlexsLemonade/OpenPBTA-analysis#working-with-the-subset-files-used-in-ci-locally

AlexsLemonade / OpenPBTA-analysis