UMAPs without mitochondria

sjspielman commented 1 year ago

This PR explores, as part of manuscript revisions, how dimension reduction might change if we do not include mitochondrial genes, as inspired by this issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601 Specifically, it seems that one sequencing center has some problematic mitochondrial gene expression distributions that may have resulted from using a different kit whose identity may also be lost to the ages.

There are two main changes in this module:

I updated the transcriptomic-dimension-reduction module to include an option to remove mitochondrial genes. I added this run for stranded RSEM with log2 (skipping tnse) in order to generate data for this exploration. But, I did not add this run to the 02 plot script , since we don't really need those.
I added a notebook to transcriptomic-dimension-reduction to explore the mitochondrial expression, first via just looking at FPKMs of mito genes for relevant diagnoses and then via UMAPs made with and without mitochondrial genes. I don't see anything too different here, but welcome thoughts on interpretation. Here's that notebook: 05-seq-center-mitochondrial-genes.nb.html.zip

Note that I had some merge conflicts bringing master in, so I had to open and clean up some the tumor purity scripts (including re-rendering a notebook), and along the way there VS Code ate lots of EOL spaces, so that's why those diffs are here!

sjspielman commented 1 year ago

@jashapiro this now filters as expected! For the record, here's how we bounce around scripts in that module:

User calls dimension-reduction-plots.sh (or CI version) to run the module
That shell script runs 01-dimension-reduction.sh, which itself calls scripts/run-dimension-reduction.R with various options
- One of those options now properly has the remove_mito_genes flag
scripts/run-dimension-reduction.R has a remove_mito_genes option and an associated if-statement for filtering out ^MT genes

I regenerated the results, and they are about the same either way. I added a small "Conclusions" section to the notebook to be more clear about that.

jashapiro commented 1 year ago

I was about to approve this and say that this looks good to me, but then I realized that this may not actually be sufficient for "removing" the mitochondrial genes. The FPKM calculation will still be using the mitochondrial genes in the denominator, which will affect the expression values for all of the other genes before we get to this point. So we may want to remove the mitochondrial genes from the count files, then recalculate FPKM.

I think we can do this using the count matrices that we have, though I believe the original the FPKM calculation occurs upstream, so we might not get precisely the same results if we reimplement the FPKM calculations. (So we should recalculate with and without MT) Before attempting this, we should probably discuss whether it is the right thing to spend time on at this moment.

My main followup thought here is that we can probably do a rough version of this using the existing FPKM or TPM matrices: With TPM, I think we can re-sum the TPM values for each sample, excluding mitochondrial genes, divide by the sum/10^6 to rescale. That won't be quite right for FPKM, but it will probably be close enough. Still, I'd probably start with TPM for this notebook (calculating UMAP from the original TPM and then from the mito-excluded & rescaled TPM) as we are just looking to see if there is an effect of removing the mitochondrial genes. If we feel we need to go back to FPKM calculations later for better consistency with previous analyses, we can still do that.

sjspielman commented 1 year ago

@jashapiro We have some TPM UMAPs 😄 Here's the re-rendered notebook exploring the results. The plots don't look wildly different from FPKM which is a nice little sanity check to see. Overall trends look roughly the same with(out) mitochondrial genes. 05-seq-center-mitochondrial-genes.nb.html.zip

For this review, please let* (edit) me know in particular any feedback on how I did the TPM prep (scripts being called from right places, and scratch/ export choice). I also kept the COUNT_THRESHOLD bash variable for filtering out low counts at 100, which seems more or less fine in the end.

sjspielman commented 1 year ago

@jashapiro As of now, CI is passing through this module, so it's ready for another look! I updated the module to wholesale remove the remove_mito_genes flag since it's not needed at all anymore. I also re-ran the module to properly generate the HTMLs with the full data (not the CI data as was there before, woops!). Here’s the notebook: 05-seq-center-mitochondrial-genes.nb.html.zip

Since the tumor purity had a diff (see my last bit here https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1658#issue-1558238784), I also decided just to be safe to re-run that module as well. All the notebook diffs are HTML miscellany.

AlexsLemonade / OpenPBTA-analysis

UMAPs without mitochondria #1658