Closed sjspielman closed 1 year ago
@jashapiro this now filters as expected! For the record, here's how we bounce around scripts in that module:
dimension-reduction-plots.sh
(or CI version) to run the module01-dimension-reduction.sh
, which itself calls scripts/run-dimension-reduction.R
with various options
remove_mito_genes
flagscripts/run-dimension-reduction.R
has a remove_mito_genes
option and an associated if-statement for filtering out ^MT
genesI regenerated the results, and they are about the same either way. I added a small "Conclusions" section to the notebook to be more clear about that.
I was about to approve this and say that this looks good to me, but then I realized that this may not actually be sufficient for "removing" the mitochondrial genes. The FPKM calculation will still be using the mitochondrial genes in the denominator, which will affect the expression values for all of the other genes before we get to this point. So we may want to remove the mitochondrial genes from the count files, then recalculate FPKM.
I think we can do this using the count matrices that we have, though I believe the original the FPKM calculation occurs upstream, so we might not get precisely the same results if we reimplement the FPKM calculations. (So we should recalculate with and without MT) Before attempting this, we should probably discuss whether it is the right thing to spend time on at this moment.
My main followup thought here is that we can probably do a rough version of this using the existing FPKM or TPM matrices: With TPM, I think we can re-sum the TPM values for each sample, excluding mitochondrial genes, divide by the sum/10^6 to rescale. That won't be quite right for FPKM, but it will probably be close enough. Still, I'd probably start with TPM for this notebook (calculating UMAP from the original TPM and then from the mito-excluded & rescaled TPM) as we are just looking to see if there is an effect of removing the mitochondrial genes. If we feel we need to go back to FPKM calculations later for better consistency with previous analyses, we can still do that.
@jashapiro We have some TPM UMAPs 😄 Here's the re-rendered notebook exploring the results. The plots don't look wildly different from FPKM which is a nice little sanity check to see. Overall trends look roughly the same with(out) mitochondrial genes. 05-seq-center-mitochondrial-genes.nb.html.zip
For this review, please let* (edit) me know in particular any feedback on how I did the TPM prep (scripts being called from right places, and scratch/
export choice). I also kept the COUNT_THRESHOLD
bash variable for filtering out low counts at 100, which seems more or less fine in the end.
@jashapiro As of now, CI is passing through this module, so it's ready for another look! I updated the module to wholesale remove the remove_mito_genes
flag since it's not needed at all anymore. I also re-ran the module to properly generate the HTMLs with the full data (not the CI data as was there before, woops!). Here’s the notebook:
05-seq-center-mitochondrial-genes.nb.html.zip
Since the tumor purity had a diff (see my last bit here https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/1658#issue-1558238784), I also decided just to be safe to re-run that module as well. All the notebook diffs are HTML miscellany.
This PR explores, as part of manuscript revisions, how dimension reduction might change if we do not include mitochondrial genes, as inspired by this issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1601 Specifically, it seems that one sequencing center has some problematic mitochondrial gene expression distributions that may have resulted from using a different kit whose identity may also be lost to the ages.
There are two main changes in this module:
I updated the
transcriptomic-dimension-reduction
module to include an option to remove mitochondrial genes. I added this run for stranded RSEM with log2 (skipping tnse) in order to generate data for this exploration. But, I did not add this run to the 02 plot script , since we don't really need those.I added a notebook to
transcriptomic-dimension-reduction
to explore the mitochondrial expression, first via just looking at FPKMs of mito genes for relevant diagnoses and then via UMAPs made with and without mitochondrial genes. I don't see anything too different here, but welcome thoughts on interpretation. Here's that notebook: 05-seq-center-mitochondrial-genes.nb.html.zipNote that I had some merge conflicts bringing
master
in, so I had to open and clean up some the tumor purity scripts (including re-rendering a notebook), and along the way there VS Code ate lots of EOL spaces, so that's why those diffs are here!