DillonHammill / CytoExploreR

Interactive Cytometry Data Analysis
60 stars 13 forks source link

Review Binned Kernel Density Estimates for Scatterplots #125

Closed DillonHammill closed 2 years ago

DillonHammill commented 2 years ago

The new version of CytoExploreR comes with a plethora of new plotting capabilities which includes a shared scale for binned 2D kernel density estimates that includes labels for counts rather than density.

I have been playing around with various options for adding support for these features and so I thought I would report on some of these in case I need to revisit this somewhere down the track.

The old version of CytoExploreR uses densCols() to assign colours to points, which actually uses KernSmooth::bkde2D() under the hood. Here is an example of a plot created with base graphics containing 50000 events in the FSC/PE channels with densCols(): image I will spare all the unnecessary details, but below is how long it takes to compute the BKDE and construct the plot:

   user  system elapsed 
   0.09    0.17    0.26 

Now, in order to add counts to the key we need to compute the binned BKDE counts which is not natively returned by KernSmooth::bkde2D() and instead performed by an internal function .linbin2d(). Below is the same image using ks::binning() to get the binned counts with colours assigned without KDE smoothing: image

   user  system elapsed 
   0.10    0.14    0.23

It is clear that ks::binning() and KernSmooth::bkde2D() perform similarly in terms of speed but a significant amount of resolution is lost when performing KDE smoothing using KernSmooth (top 2 clusters become merged).

OK, so what if we perform KDE smoothing on these counts using ks::kde()? Well, we significantly improve the resolution within the plot but it comes at the cost of speed (probably because binning is performed twice): image

   user  system elapsed 
   0.56    0.15    0.71 

Alright, so KernSmooth is faster but is lower in resolution and doesn't export the binned counts that we need. The Kernsmooth:::linbin2D() is lightning fast, may be we can write a custom function that can export these counts prior to computing kernel density. We need to find a better way to set the bandwith in Kernsmooth to match the resolution of ks.

DillonHammill commented 2 years ago

I should also note that ks::binning() gives almost the exact same output as KernSmooth:::linbin2D() but the latter is MUCH faster thanks to fortan backend. image

DillonHammill commented 2 years ago

Alright! The difference in resolution between ks and KernSmooth is to do with the way the bandwidth is computed. ks uses a plug-in approach through Hpi() which is a 2D generalised version of dpik() in KernSmooth. However, grDevices::densCols() doesn't actually use this for computing the bandwidth in bkde2D(). So if we replace the bandwidth with those computed by dpik() when get the following plot using KernSmooth::bkde2D(): image

Pretty cool! It looks like the resolution is even better than for ks::kde() and we still get all the speed improvements of sticking to KernSmooth::bkde2D()!

For the counts, I think we can just import KernSmooth:::linbin2D() and run it when required (for example in .cyto_plot_key_scale() to get the counts for the key) - I doubt this will introduce any noticeable increase in plotting speed and it is definitely worth the cost to display this information in the key.

These changes will take effect in the next version of CytoExploreR (coming soon). :)