PseudobulkingBarPlot improvements

bimberlabinternal / CellMembrane

An R package with wrappers and pipelines for single cell RNA-seq analysis

10 stars 3 forks source link

PseudobulkingBarPlot improvements #237

Closed GWMcElfresh closed 6 months ago

GWMcElfresh commented 7 months ago

Hi everyone,

I'm testing the pseudobulking pipeline more thoroughly ~~and I caught a bug concerning log transformations for down regulated genes (negative number)~~ and supporting contrast directionality swapping in the PseudobulkingBarPlot.

Let me know if you can think of any other improvements that might be helpful!

-GW

GWMcElfresh commented 7 months ago

Ah, as it turns out, stacked bars don't play nicely with log scales. Perhaps not worth supporting. I'll remove this in a commit on Monday.

For example, for log transformations the reference point is 1. In fact, when using a log scale, geom_bar() automatically places the base of the bar at 1. Furthermore, never use stacked bars with a transformed scale, because scaling happens before stacking. As a consequence, the height of bars will be wrong when stacking occurs with a transformed scale.

https://ggplot2.tidyverse.org/reference/geom_bar.html

GWMcElfresh commented 7 months ago

Thinking about this: the point of the log axis would be to improve the relative legibility within each facet. If we let the y scale be free during faceting, each facet only varies by a 1-2 orders of magnitude by definition. I think this accomplishes the primary goal anyway

GWMcElfresh commented 7 months ago

This last commit was a quick fix to FitRegularizedClassificationGlm(). NormalizeAndScale() now infers the nCount_Assay variable to regress out. 'nCount' metadata fields don't exist after pseudobulking, and we probably don't want to regress on them anyway, so I fixed this to always be NULL.

GWMcElfresh commented 7 months ago

Thinking about this: the point of the log axis would be to improve the relative legibility within each facet. If we let the y scale be free during faceting, each facet only varies by a 1-2 orders of magnitude by definition. I think this accomplishes the primary goal anyway.

also, this doesn't work for some reason. geom_bar() is cool for the stacked bars in the positive and negative directions, but it doesn't play nicely with any transformations, so I'm going to just remove the y-axis related arguments.

GWMcElfresh commented 7 months ago

This last commit removes the legacy option for the QLF. The main purpose here is to expose leverage as a statistic for the GLM so we can support some kind of gene-level outlier detection.

GWMcElfresh commented 6 months ago

Just some diligence here:

2022411 fixes a subsetting issue with metadata fields with an underscore that weren't gsub("_", ".", field) 'd

155a24f overhauls the bar plot dataframe to store information much closer to the data actually displayed in the bar plot (1 row per contrast with magnitudes of differentially expressed genes), as opposed to being more like a slightly-pruned-but-still-very-long result from RunFilteredContrasts().

Up next is DDE, recording logCPM on each side of the contrast, leverage filtering, and probably some more gene set comparisons. Maybe something like a cross validated classifier to see how well the DEGs do on classifying a phenotype like what Paul & co have done with their DDE?