RGLab / openCyto

A package that provides data analysis pipeline for flow cytometry.
GNU Affero General Public License v3.0
75 stars 29 forks source link

Support empty dims #235

Closed mikejiang closed 2 years ago

mikejiang commented 2 years ago

@DillonHammill I included three test cases https://github.com/RGLab/openCyto/blob/support-empty-dims/tests/testthat/test-clustering-gating.R that covers the use case you stated here https://github.com/RGLab/openCyto/pull/234#issuecomment-1023533316 Let me know if this works for you

DillonHammill commented 2 years ago

Thanks so much for taking the time to put these changes together @mikejiang. As always, I really appreciate your help.

I have been testing using empty dims for the sampling gating method, but it only seems to work when the GatingSet contains no other gates - see example code below:

library(openCyto)
library(flowWorkspace)
gs <- GatingSet(GvHD[1])

#create a dummy gating function that mimic sampling method
dummy_func <- function(fr, pp_res, channels, ...)
{
  n = nrow(fr)
  res = rep(TRUE,n)
  return (res)
}
#register it as the plugin gating method
register_plugins(dummy_func, "sample_method")
# Add a gate
gs_pop_add(
  gs,
  parent = "root",
  gate = rectangleGate(
    "FSC-H"=c(225,500), 
    "SSC-H"=c(15, 150),
    filterId="test"
  )
)
#apply it to the gs
gs_add_gating_method(
  gs, 
  alias = "A", 
  parent = "root", 
  pop = "+", 
  dims = "", 
  gating_method = "sample_method"
)

If I run all the code below I get an error at the check we have spoken about previously

Error in .preprocess_row(this_row, strict = strict) : 
  + has invalid number of dimensions: 0 

However, if I don't add the test gate to the GatingSet prior to gating everything works just fine. I will debug within openCyto and see if I can help track down the problem.

I suspect I will encounter the same issue when I try to testing clustering gating methods as well.

DillonHammill commented 2 years ago

OK, so the issue arises due to single line data.table (i.e. no previous gates) converting the empty dims to "NA" in .preprocess_csv():

  for(col in colnames(dt))
  {
    if(col != "collapseDataForGating")
    {
      newCol <- dt[[col]]
      dt[, (col):= as.character(newCol)]
    }
  }

This is passed to preprocess_row() where the dim_count check is performed:

dim_count <- length(strsplit(split = ",", dims)[[1]])

The above returns 1 when dims = "NA" thus bypassing this check and allowing the code to proceed.

On the other hand, if there are other gates defined in the template, dims is converted to "" as we would expect and dim_count = 0 but this causes the dim_count check to fail due to the restriction on allowed gating methods here:

  if (!dim_count %in% c(1, 2)) {
    if (!(dim_count == 0 && gm %in% c("polyFunctions", "boolGate", "refGate"))) {
      stop(popName, " has invalid number of dimensions: ", dim_count)
    }
  }
mikejiang commented 2 years ago

I have pushed the fix try again

DillonHammill commented 2 years ago

Thanks @mikejiang, looks like it the sampling gating method is working now. I will test the clustering gating method today. I will let you know if I encounter any issues.

Once these changes have been made, can I submit a new PR with the other changes that I would like to make (I will close the old PR to save me some time)?

Please review the proposed additional changes below and let me know if there are any that you don't want to make:

mikejiang commented 2 years ago

I'm not sure about the first one, the rest two looks ok

DillonHammill commented 2 years ago

@mikejiang, could we also update your changes to .preprocess_row() to cover the NA dims case as well, since this is the default in gs_add_gating_method()? At least then if you don't supply any dims to gs_add_gating_method() gating can still proceed. The below change to .preprocess_row() works well in testing:

  if(popName == "*" ){
    if(alias == "*"|| dims %in% c("", NA))
      return(this_row)
    else
      return(.gen_dummy_ref_gate(this_row))
    }
DillonHammill commented 2 years ago

@mikejiang, the first change is more for consistency with the other columns in the gatingTemplate that return NA for empty entries - dims seems to be unique in that it returns "" instead. I can easily work around this on my end, the other changes are far more important.

DillonHammill commented 2 years ago

Hi @mikejiang, one additional change I would like to make is to get the gating to work properly when groupBy=NA and collapseDataForGating=TRUE. At the moment, each sample is placed in its own group, when instead all the samples should be collapsed into the same group. I have add to add a dummy variable to pData() to get around this up until now but I have taken a look at the code and it is just missing an else if statement (gating-methods.R line 276 onwards):

    if (groupBy != "" && isCollapse) {
      #when x@collapse == FALSE,then ignore groupBy argument since grouping is only used for collapsed gating
      split_by <- as.character(groupBy)
      split_by_num <- as.numeric(split_by)
      #split by every N samples
      if(!is.na(split_by_num)){
            nSamples <- length(parent_data)
            if(nSamples==1){
              split_by <- 1
            }else{
              split_by <-  sample(rep(1:nSamples, each = split_by_num, length.out= nSamples))  
            }

      }else{
        #split by study variables
        split_by <- strsplit(split_by, ":")[[1]]
        split_by <- apply(pData(parent_data)[, split_by, drop = FALSE], 1, paste, collapse = ":")
        split_by <- as.character(split_by)
      }
    } else if(groupBy == "" && isCollapse) { # ADDED THIS PART
      split_by <- rep("all", length(parent_data))
    } else {
      split_by <- sampleNames(parent_data)
    }
    fslist <- split(parent_data, split_by)

Since the names of fslist is required to order pp_res we will need to have a standard name for the group in this case, I have just called it "all" here to match the behaviour in CytoExploreR - but any name would do.

In testing, this fixes the issue I was having. Please let me know if this change is OK (also what name to give the group containing all samples) and I will include it in my new PR once your changes here have been pushed to master.

I have also been testing the clustering gating method using factors and it seems to work well. I was just wondering if there is a way to bypass the default naming of the factor levels? I take care of this on my end so I don't need to prefix the factor levels with the name of the gating method. Perhaps I can avoid this by using a multipleFilterResult instead? I will give this a try today.

mikejiang commented 2 years ago

Can you confirm this PR works as expected so that I can merge this first?

DillonHammill commented 2 years ago

Sure no problem, I am just testing the FilterResult method to see if that gets around the default factor level naming. I will report back soon.

DillonHammill commented 2 years ago

Thanks @mikjiang, looks like this PR is now working as expected - please close my original PR and I will prepare a new PR with additional changes I would like to make once you have pushed these changes to master.