Suggestions to improve user experienced

JEFworks commented 3 years ago

As I'm using STdeconvolve on new datasets, here are some enhancements that I believe will help improve the user experience. This is a running list. Please feel free to add and check off as needed.

[x] fitLDA would benefit from a progress bar that shows if verbose=TRUE. See https://github.com/r-lib/progress
[x] vizAllTopics should check if groups is a factor and cast if not (otherwise throws errors)
[x] vizAllTopics would benefit from automatically setting an appropriate r based on the scale of pos unless users override with a specific choice (ex. max(0.5, max(pos)/nrow(pos)*10 or something)
[x] restrictCorpus should limit the number of genes to some top most variable set (similar to how veloviz does it) in the event there are too many features (ex > 1000 by default or some other user-specified number)
[x] fitLDA needs a legend to indicate the shaded regions which are where alpha > 1
[x] progress bar for plotting? (I don't think this is possible with ggplot, so maybe some indication it is doing something?
[x] function to rotate the coordinates for plotting (UPDATE: see responses below for quick solution to this)

JEFworks commented 2 years ago

[x] fitLDA plot doesn't render if only one K is provided. Can check if length(Ks)==1 and make one point if that's the case.
```
ldas <- STdeconvolve::fitLDA(t(as.matrix(corpus)), Ks = 2)
```

bmill3r commented 2 years ago

Potential speed ups?

[x] separate the perplexity computation, rare cell type computations, and subsequently the perplexity and rare cell type plot from the LDA fitting (perplexity and rare cell types take time if the K is large)

UPDATE: the rare cell-type computations, perplexity, etc take a while if they are computed using a new corpus. But because we are interested in the same corpus we fit an LDA model too, we don't need to indicate a new corpus. So this speeds things up a little bit.

JPingLin commented 2 years ago

Maybe a progress bar when running vizAllTopics? I was plotting ~3500 spots (per visium square), it takes more than 5 min for image to show. At some point I was wondering if my R froze or it is the normal behavior.

bmill3r commented 2 years ago

Hi JPingLin,

Thanks for the suggestion! Yes - I have noticed that vizAllTopics can take a while especially if there are a lot of pixels and cell-types to plot. Most likely it is because of all the individual scatterpie charts ggplot2 has to make. A progress bar would be useful for this purpose. I will see if I can incorporate something. Perhaps a suggestion for now could be to plot sections of the entire square separately and make sure that colors for each of the deconvolved cell-types in the theta proportion matrix are explicitly stated in the topicCols parameter just in case a given cell-type is not present in the section of pixels being plotted.

Let me know if you have any other questions or suggestion, Brendan

JPingLin commented 2 years ago

Hi Brendan, thanks for the great tool, the installation was smooth and error free off the bat! I have one question and one suggestion: For the step "remove genes present in 5% or less of pixel", will this remove the highly specific genes to a population that is known to be presented in less than 5% of the brain cells? For example, some genes are unique to ependyma/vasculature, and their present in the sampled brain is lower than 3500*0.05 = 175 spots, will they get excluded completely? Or maybe my understanding of this step is not correct.

One suggestion, I think it will be useful to incorporate a function to flip the coordinates easily in plots. I know this might be related to the issue of (0, 0) starting from upper left, or lower left corner in axis from different program. And might be related to how initially pixel/spot data was prepared coming out from specific platform. Right now the plots are always upside down for me if using visium output.

bmill3r commented 2 years ago

Hi JPingLin,

Your understanding is correct - by default, genes present in less than 5% of the total pixels in a given dataset will be removed and not included in the final corpus used as input into STdeconvolve. The motivation behind this filtering step is to remove genes that were poorly captured across pixels in the ST experiment, and may not be accurately assigned to clusters of tightly occurring and non-overlapping expressed genes. Depending on the dataset, however, 5% can actually represent a large number of pixels, and so perhaps a lower threshold can also be appropriate, especially if the goal is to identify and include overdispersed genes that may be marking rare cell-types.

Using restrictCorpus() to filter the counts matrix into the final input corpus, the thresholds for the number of pixels can be selected by changing the parameters removeAbove and removeBelow. For example:

inputCorpus <- restrictCorpus(counts,
                              removeAbove = 1.0,
                              removeBelow = 0.05
                             )

where removeBelow in this case removes genes present in less than 5% of pixels.

Alternatively, you can also use preprocess to filter the starting counts matrix into the input corpus:

preprocess(dat,
          selected.genes = NA,
          nTopGenes = NA,
          genes.to.remove = NA,
          removeAbove = NA,
          removeBelow = NA,
          min.reads = 1,
          min.lib.size = 1,
          min.detected = 1,
          ODgenes = TRUE,
          nTopOD = 1000,
          verbose = TRUE
          )

dat is the pixel (row) x gene (columns) gene counts matrix,
nTopGenes is the number of top expressed genes to remove
genes.to.remove can be a vector of gene names or patterns to remove
removeAbove and removeBelow are the same as in restrictCorpus()
min.reads is the minimum number of detected reads a gene need to has to keep
min.lib.size is the minimum number of reads a pixel needs to have to keep
min.detected is the minimum number of pixels a genes needs to be detected in to keep
ODgenes is a flag indicating if the feature selection will be overdispersed genes
nTopOD is the maximum number of top OD genes retained in the final corpus
selected.genes is a vector of gene names to use specifically, if one has a list of genes (marker genes for example) that one would like to use instead. If this option is used, then removeAboveandremoveBelowshould be set toNAandODgenes = FALSE`, else these parameters will still be applied to the list of selected genes.

If there is a list of cell marker genes you would like to include in addition to the overdispersed genes, you could first feature select for the overdispersed genes using restrictCorpus, and then apply preprocess to the original counts matrix, using the list of overdispersed genes found via restrictCorpus() plus additional marker genes. For example:

inputCorpus <- restrictCorpus(counts,
                             removeAbove = 1.0,
                             removeBelow = 0.05
                             )

inputCorpus <- preprocess(dat,
                           selected.genes = c(rownames(inputCorpus), c(markerGenes) ),
                           nTopGenes = NA,
                           genes.to.remove = NA,
                           removeAbove = NA,
                           removeBelow = NA,
                           min.reads = 1,
                           min.lib.size = 1,
                           min.detected = 1,
                           ODgenes = FALSE,
                           nTopOD = 1000,
                           verbose = TRUE)

This is a lot of information, so let me know if any of this doesn't make sense or you have additional questions.

bmill3r commented 2 years ago

One suggestion, I think it will be useful to incorporate a function to flip the coordinates easily in plots. I know this might be related to the issue of (0, 0) starting from upper left, or lower left corner in axis from different program. And might be related to how initially pixel/spot data was prepared coming out from specific platform. Right now the plots are always upside down for me if using visium output.

This is definitely a good idea and it is most likely an issue with the relative placement of (0,0) with respect to the original image and the plotting coordinate system used in R. Will see if I can come up with a simple function to transform. In the meantime, one could do something like this:

pos[, "y"] <- pos[, "y"] * - 1

to essentially flip the plotted pixels upside down. Conversely, you could do the same thing with the x-coordinates of the pixels.

JEFworks-Lab / STdeconvolve

Suggestions to improve user experienced #1