Updated analysis: RNA expression of copy number losses

jaclyn-taroni commented 4 years ago

What analysis module should be updated and why?

focal-cn-file-preparation, specifically the 02-rna-expression-validation.R.

Why should the module be updated?

As noted in https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/367#issue-356029454, this step gets OOM-killed for me locally and we now include collapsed RNA-seq matrices in the data download that this script could use.

What changes need to be made? Please provide enough detail for another participant to make the update.

At the moment, it is not clear to me what step(s) requires a lot of RAM or takes a long time to run. This file is the first place I would look: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/focal-cn-file-preparation/util/rna-expression-functions.R

What input data should be used? Which data were used in the version being updated?

Previously, RSEM FPKM files were used. I propose that we use pbta-gene-expression-rsem-fpkm-collapsed.polya.rds and pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds.

We also are currently using the ControlFreeC file produced by that module (analyses/focal-cn-file-preparation/results/controlfreec_annotated_cn_autosomes.tsv.gz), but we will eventually want to move to using some kind of consensus file (#128).

When do you expect the revised analysis will be completed?

This may take 1-3 days.

Who will complete the updated analysis?

Not sure.

jaclyn-taroni commented 4 years ago

The focal-cn-file-preparation module is being revamped (and sped up!) over on #452. It's worth noting that those revisions may have addressed the major bottleneck for that analysis module. I think profiling and improving the RNA-seq expression levels of losses code is a good step to follow #452. As such, I am going to mark this in progress and assign @cbethell. As noted above, we also to look at the expression levels for copy number losses that are in the consensus calls.

cbethell commented 4 years ago

In addition to the plots in PR #493, density plots were generated using the consensus SEG autosomes files, annotated in the focal-cn-file-preparation module. These plots use z-scored expression values (polyA and stranded expression files are handled separately) to look at the density of copy number calls, specifically looking to validate the losses.

The rendered notebook can be seen here.

The first two plots in the notebook are looking at calls across all genes in the annotated consensus SEG files. In these plots, there does not appear to be much differentiation between neutral and loss calls. Below them are facetted plots focusing on each driver gene. In some instances, these plots agree with the plots above, and in others the plots appear to look slightly more as we would expect (ex. MET).

AlexsLemonade / OpenPBTA-analysis