alanocallaghan / scater

Clone of the Bioconductor repository for the scater package.
https://bioconductor.org/packages/devel/bioc/html/scater.html
94 stars 40 forks source link

Use scattermore to speed up plotColData and plotReducedDim for large datasets #191

Closed lambdamoses closed 1 year ago

lambdamoses commented 1 year ago

Hi,

I have used the package scattermore in my Bioconductor package Voyager with great success when plotting datasets that have hundreds of thousands of cells. Plotting almost 400,000 MERFISH cell centroids in histological space as points took 10 seconds or so on our server, while using scattermore to do the same only took about 2 seconds without significant visual differences. While 10 seconds don't sound too bad for one panel, using scattermore greatly helps when making multi panel plots for multiple genes. I know that your package uses ggrastr so we can avoid pdf woes when plotting lots of cells. However, ggrastr doesn't seem to speed up plotting. It took around 11 seconds to plot the same 400,000 cell centroids. scattermore also rasterizes the scatter plot without rasterizing the labels.

It would be great if you can give an option to use scattermore in plotColData and plotReducedDim for large datasets. It would especially help in the PCA matrix plot and when plotting multiple genes where the cells are plotted in multiple panels. Also, while the cells don't overlap in histological space, they usually do pile up a lot in PCA and UMAP, so using scattermore can bring a even larger speed boost.

Shall I pull request?

lambdamoses commented 1 year ago

Also related: when there are too many cells and overplotting is a huge issue, how about use ggplot2::stat_summary_2d() to bin the cells and summarize each bin with a user supplied function, such as sum and mean, in plotColData and plotReducedDim?

alanocallaghan commented 1 year ago

Sounds good

alanocallaghan commented 1 year ago

Not too sure about the stat2d thing, can you maybe hack up a quick example before PRing?

lambdamoses commented 1 year ago

Example of stat2d: when there're lots of cell, so many that they are hard to see when plotted in space, we may bin them in space. If your are plotting a feature of the cell, say total UMI count, then you can compute the sum of the UMI counts in each bin and then color the bins with the sums. It can also be mean or median. An example in histological space: https://pachterlab.github.io/voyager/articles/vig3_slideseq_v2.html#quality-control-qc scroll down a bit to see that plot.