DillonHammill / CytoExploreR

Interactive Cytometry Data Analysis
61 stars 13 forks source link

Support custom statistics in cyto_stats_compute() #69

Closed DillonHammill closed 3 years ago

DillonHammill commented 4 years ago

In line with comments on #32, I will be making some significant changes to cyto_stats_compute() to allow the use of custom statistical functions. This is still a work in progress but it is likely that custom functions will need to operate at the flowFrame/cytoframe level instead of the at the level of the raw data as described in #32. This is because it is often desirable to have more control over when transformations are applied (i.e. some statistics are computed on transformed scale and others are not) and majority of statistics within cyto_plot() are calculated at the flowFrame/cytoframe level. To appropriately handle these preprocessing steps I have added a new cyto_data_preprocess() function that performs all the preprocessing steps to get the raw data matrix from a flowFrame (including copying, gating, transformations and channel restriction). This means that custom statistical functions would take the following form (here is a custom mean function):

# x is a flowFrame
custom_mean_func <- function(x,
                             channels = NULL,
                             trans = NA,
                             gate = NA,
                             inverse = TRUE){

 # PREPROCESS TO MATRIX
 x <- cyto_data_preprocess(x,
                           channels = channels,
                           trans= trans,
                           gate = gate,
                           inverse = inverse)

 # CALCULATE STATISTIC
 apply(x, 2, mean)
}

This also means that if you want to perform your own custom preprocessing you don't need to use cyto_data_preprocess() within your function - all the essential arguments will be available should you need them. You can also add any additional arguments required to compute your custom statistic. Once you have written your custom function you can then use it directly in cyto_stats_compute() and cyto_plot() as well:

cyto_stats_compute(gs,
                   alias = "CD4 T Cells",
                   stat = "custom_mean_func",
                   channels = c("CD44", "CD69"))

cyto_plot(gs,
          parent = "CD4 T Cells",
          channels = "CD69",
          label_stat = "custom_mean_func")

The name of the function will be used in exported tables so make sure that you give it a good name. It is also important to note that your function should return a statistic for each of the channels, using apply as above is a good way to do this succinctly.

DillonHammill commented 4 years ago

The above implementation expects a single statistic per parameter but that may not always be the case. Here is the definition of a custom quantile function (quantiles are now natively supported - this is just an example):

cyto_quantile <- function(x,
                          channels = NULL,
                          trans = NA,
                          gate = NA,
                          inverse = TRUE,
                          ...){

 # PREPROCESS
 x <- cyto_data_preprocess(x,
                           channels = channels,
                           trans= trans,
                           gate = gate,
                           inverse = inverse)

 # QUANTILES
 apply(x, 2, quantile, ...)
}

Now if we call this function within cyto_stats_compute() to calculate the 5th and 95th quantile:

cyto_stats_compute(gs,
                   alias = "CD4 T Cells",
                   stat = "cyto_quantile",
                   channels = c("CD44", "CD69"),
                   probs = c(0.05, 0.95))

Running the cyto_quantile() function will return a matrix of width channels and height quantile:

    Alexa Fluor 647-A 7-AAD-A
5%              21.78 -172.53
95%          31910.34 5379.21

It would be nice if this was appropriately handled by cyto_stats_compute() as well. In this case the rownames have been set so that we can use this as an additional parameter in the exported data. Perhaps, I could add a new variable with the name of the function used and populate it with these rownames that way we can calculate any number of statistics per parameter. The only requirement would be that there are rownames. In the case where rownames have not been set we will resort to populating this with the function name and the row number (e.g. cyto_quantile_1, cyto_quantile_2, cyto_quantile_3 and so on...).

Multilevel statistics are not currently supported in cyto_plot() and it may remain that way, but at least you don't have to run cyto_stats_compute() multiple times to compute multiple statistics.

DillonHammill commented 4 years ago

These data processing steps and function dispatch can now be handled directly within cyto_apply() as described #72. All the data formatting steps will need to be performed within cyto_stats_compute() after cyto_apply() has been called. Working on this now...

DillonHammill commented 3 years ago

The statistical suite of functions have been completely re-vamped in CytoExploreR version 2.0.0.

As described above we now use cyto_apply() to handle applying functions to cytometry objects (see the awesome input argument for details) and we have a suite of internal cyto_stat_() functions which compute statistics over vectors or matrices.

Users can now pass ANY function through the stat argument and indicate the required input format for the data passed to this function (cytoframe, cytoset, matrix, column or row) and cyto_apply() will take care of the rest.