RGLab / openCyto

A package that provides data analysis pipeline for flow cytometry.
GNU Affero General Public License v3.0
77 stars 29 forks source link

Add support for empty dims in gatingTemplate #234

Closed DillonHammill closed 2 years ago

DillonHammill commented 2 years ago

Hi @mikejiang,

I have taken the time to look through the openCyto code base to better add support for my empty dims use case as described in my previous PR #229. I am busy implementing some new CytoExploreR plugins that require this functionality.

The code is well safeguarded against the empty dims case as it is, but I have strengthened this a bit to address your concerns regarding parsing of complex pop (-/+ etc.). As I mentioned previously, for the empty dims case we would never want to use complicated pop entries, it simply doesn't make any sense to use anything other than pop=+ for this case. Below is summary of the changes that I have made:

  1. I converted the empty dims error that I added in my last PR (code chunk below) into a warning instead to allow the gating process to proceed whilst at the same time warning users. I doubt that anyone other than me will attempt to do this, but at least now there is an appropriate warning when dims have been omitted by accident. https://github.com/RGLab/openCyto/blob/8c438714e9ded55058786e3f0f6a1b0015304941/R/gating-methods.R#L264-L272

  2. I have updated gh_generate_template() to record empty dims as NA in the gatingTemplate, the current implementation (below) returns an empty string. I noticed this a while back in CytoExploreR, I think it safer to have empty dims set to NA (consistent with other gatingTemplate parameters), otherwise I have to perform an separate check for empty strings for dims. I haven't checked it this affects other gating functions yet I will take a closer look tomorrow. Happy to remove this change if you think it is going to cause issues somewhere else. https://github.com/RGLab/openCyto/blob/8c438714e9ded55058786e3f0f6a1b0015304941/R/csvTemplate-parser.R#L26

  3. I have updated .preprocess_row() to handle the new NA dims case above and I have removed the dim_count check in the CSV parsing function. Perhaps you will have a better way to keep this check in place. https://github.com/RGLab/openCyto/blob/8c438714e9ded55058786e3f0f6a1b0015304941/R/csvTemplate-parser.R#L290-L295

  4. In terms of parsing pop, I have put a safeguard in place for the pop=-/+ case to throw an error when attempting to parse complex pop when dims are empty. I haven't added this to the other pop cases as they are covered by .splitTerms which performs it own dims length check. https://github.com/DillonHammill/openCyto/blob/476560e2651d57ff8b3a00d82affcdf316017e8b/R/csvTemplate-parser.R#L339-L342

Happy to discuss this further and amend as necessary - this is something that I definitely want supported for CytoExploreR. I doubt any openCyto users will actually attempt to this themselves and CytoExploreR users don't interact directly with the gatingTemplate anyway (I take care of that for them). I would just like to have an implementation that I can use even if no-one else uses it.

Thanks for your help.

Dillon

DillonHammill commented 2 years ago

I have also added an informative error message to the gatingTemplate constructor when the supplied gatingTemplate (CSV or data.table) is empty.

DillonHammill commented 2 years ago

I have also moved the stop.at break in the gt_gating() for loop to make sure that the gating process is terminated after the stop.at node is gated.

See example below. At the moment it is not possible to use stop.at gate up to (and including) Live Cells because the stop.at node itself is not gated. If you try and descend down the tree and set stop.at = "Dendritic Cells" then the T Cells are still gated. image

However, if we delay the break in the gt_gating() for loop until after the stop.at node has been gated we are able to successfully gate up until (and including) the Live Cells.

mikejiang commented 2 years ago

Your change is a little intrusive, especially the removal the validity check of dim_count makes me somewhat uncomfortable. I am trying to understand your use case in order to come up a better a solution for you, is this following API call what we are aiming for?

  gs_add_gating_method(gs, alias = "*", parent = "root", pop = "*", dims = "", gating_method = "clustering_method_A")

If so, I can simply conditionally do sub-setting on data, that is

parent_data <- parent_data[, channels]
DillonHammill commented 2 years ago

Thanks for looking into this @mikejiang. Yeah unfortunately I couldn't think of a better way of getting around that check, even if you bypass the subsetting for empty dims (mentioned in #229) the gating terminates at that check since the gating_method is not either polyFunctions, boolGate or refGate. I thought of potentially flipping this on its head and instead check for methods that require 1 or 2 dims but I think this list would be too extensive.

Yeah I aiming for API call similar to the one you mentioned where I pass the dims through gating_args instead and pass that to the gating function directly for subsetting prior to clustering. I think my reasoning for this was due to there being a 1 or 2 dim length restriction somewhere.

After thinking about this a bit more, maybe it would be a better idea to instead allow more than 2 dims then I could just pass them through there directly and the subsetting step no longer becomes an issue. It also means that we can keep that check (that I agree) we should keep. What do you think about this approach? I can fix up my PR to instead do things this way if you think it is a better way. I will take another look tomorrow and have a go at implementing things this way.

mikejiang commented 2 years ago

No, much of the code base was designed for 1D or 2D gating, thus extending to more than 2D requires significant non-trivial changes, however currently there is special treatment for the use case of alias= and pop = , which is reserved for clustering algorithm that returns a factor or logical vectors representing multiple cell populations. What does your gating function returns exactly?How do you name these populations? Through alias column? Can you give me the exact example gating function specs in terms input and output? so I can work on the solution toward that direction

DillonHammill commented 2 years ago

OK, I have a couple of different gating methods that don't adhere to the 1D/2D gating paradigm.

The first is simply a gating method that takes on a node and returns a logical vector indicating the events to keep in the sampled population (simply a samplibg gating method). In this case, dims are irrelevant as the gating method just calls nrow on the flowFrame and computes a sample size. This is where I would like to keep dims empty, otherwise it looks like the node has been sampled based on whatever channel(s) I put in dims (which is not the case).

The second case fits with what you have mentioned in your previous comment. In this case, the gating method takes on a multidimensional flowFrame (i.e. minimum 1 dim and maximum many dims) performs multidimensional clustering and returns a factor with default names for new populations stored in levels. For supervised clustering where the number of clusters is known, the names can be passed to alias directly, but for unsupervised clustering where any number of clusters is possible I instead pass default names through the factor levels and set alias=*. At the moment I pass the dims to the gating function through gating_args and would like to leave dims empty so that the data isn't subsetted to 1 or 2 dims before getting passed to the gating function.

Thanks again for your help.

mikejiang commented 2 years ago

see https://github.com/RGLab/openCyto/pull/235