RGLab / openCyto

A package that provides data analysis pipeline for flow cytometry.
GNU Affero General Public License v3.0
77 stars 29 forks source link

Problem with start/stop.at argument in gt_gating function #239

Closed htanaka1234 closed 2 years ago

htanaka1234 commented 2 years ago

I ran into a problem while working on bypassing Issue #238. When I specify start/stop.at in the gt_gating function, I get a "Can't find [node]" error. When using a complex gating template and specifying start/stop.at in the gt_gating function, I sometimes get a "reference node not found" error. I am guessing that this error is caused by using different node list variables, the order of the nodes often differs between the two variables, when getting the index of the "start" node from "nodePaths" variable and when using this index to specify the node list to be gated from "gt_nodes" variable in the .gating_gatingTemplate function.

.gating_gatingTemplate <- function(x, y, env_fct = NULL, start = "root", stop.at = NULL, keep.helperGates = TRUE, ...) {
  gt <- x
  if (!is.null(env_fct)) {
    # use the fcTree if already exists
    if (exists("fct", env_fct)) {
      fct <- get("fct", env_fct)
    } else {
      # create one from gt if not
      fct <- fcTree(gt)
      assign("fct", fct, env_fct)
    }

  }

  nodePaths <- names(sapply(gt_get_nodes(gt),alias))
    #validity check for stop.at argument
  if(!is.null(stop.at)){
    #escape meta character
    stop.pat <- gsub("\\+", "\\\\+", stop.at)
    stop.pat <- paste0(stop.pat, "$")
    if(substr(stop.at, 1,1) != "/")
      stop.pat <- paste0("/", stop.pat) #prepend path delimiter to avoid partial node name matching

    matchInd <- grep(stop.pat, nodePaths)
    if(length(matchInd) == 0)
      stop("Can't find stop point: ", stop.at)
    else if(length(matchInd) > 1)
      stop("ambiguous stop point: ", stop.at)
  }
  # gate each node 
  **gt_nodes <- tsort(gt)[-1]#by the topological order**

  #try to skip some nodes to save time
  if(start != "root"){
    #escape meta character
    start.pat <- gsub("\\+", "\\\\+", start)
    start.pat <- paste0(start.pat, "$") #treat it as terminal node
    if(substr(start, 1,1) != "/")
      start.pat <- paste0("/", start.pat) #prepend path delimiter to avoid partial node name matching
    matchInd <- grep(start.pat, nodePaths)

    if(length(matchInd) == 0)
      stop("Can't find start point: ", start)
    else if(length(matchInd) > 1)
      stop("ambiguous start point: ", start)

    **gt_nodes <- gt_nodes[matchInd:length(gt_nodes)]**
  }  
mikejiang commented 2 years ago

Can you provide a reproducibe example?

htanaka1234 commented 2 years ago

Below is a minimal code that reproduces the problem. One correction: stop.at works as expected, but only start behaves differently than expected.

testGatingTemplate.csv

library(dplyr)
library(data.table)
library(flowCore)
library(flowWorkspace)
library(openCyto)

dataDir <- system.file("extdata",package="flowWorkspaceData")
files <- list.files(dataDir, pattern = ".fcs$",full = TRUE)

fs <- read.flowSet(files[1])

compMat <- spillover(fs[[1]]) %>% compensation
fs_comp <- compensate(fs, compMat)
chnls <- parameters(compMat)
transFuncts <- estimateLogicle(fs[[1]], channels = chnls)
fs_trans <- transform(fs_comp, transFuncts)

gs <- GatingSet(fs_trans)

gt <- gatingTemplate("path.to.gating.template")
# nodePaths
# used to check whether the node specified in the start/stop.at argument exists
# and to obtain the start position matchInd
names(sapply(gt_get_nodes(gt), openCyto:::alias)) 

[1] "root"
[2] "/boundary"
[3] "/boundary/singlet"
[4] "/boundary/singletRefGate"
[5] "/boundary/singlet/Lymph"
[6] "/boundary/singletRefGate/singletRefGate_LymphRefGate" [7] "/boundary/singlet/Alexa700_gate"
[8] "/boundary/singlet/AmCyan_gate"
[9] "/boundary/singlet/APC_gate"
[10] "/boundary/singlet/PerCPCY5_5_gate"
[11] "/boundary/singlet/Lymph/Alexa700pPerCPCY5_5p"

# gt_nodes
# actual gating proceeds according to this order
RBGL::tsort(gt)[-1]

[1] "/boundary"
[2] "/boundary/singlet"
[3] "/boundary/singlet/PerCPCY5_5_gate"
[4] "/boundary/singlet/APC_gate"
[5] "/boundary/singlet/AmCyan_gate"
[6] "/boundary/singlet/Alexa700_gate"
[7] "/boundary/singlet/Lymph"
[8] "/boundary/singlet/Lymph/Alexa700pPerCPCY5_5p"
[9] "/boundary/singletRefGate"
[10] "/boundary/singletRefGate/singletRefGate_LymphRefGate"

gt_gating(gt, gs, stop.at = "/boundary/singlet/Alexa700_gate")

Gating for 'boundary' done! done. Gating for 'singlet' done! done. Gating for 'PerCPCY5_5_gate' done! done. Gating for 'APC_gate' done! done. Gating for 'AmCyan_gate' done! done. stop at: /boundary/singlet/Alexa700_gate finished.

gt_gating(gt, gs, start = "/boundary/singlet/Alexa700_gate")
# expected to proceed in the following order
# [6] "/boundary/singlet/Alexa700_gate"                     
# [7] "/boundary/singlet/Lymph"                             
# [8] "/boundary/singlet/Lymph/Alexa700pPerCPCY5_5p"        
# [9] "/boundary/singletRefGate"                            
# [10] "/boundary/singletRefGate/singletRefGate_LymphRefGate"

Preprocessing for 'Lymph' Gating for 'Lymph' Using the serial version of flowClust done! done. Population 'Alexa700pPerCPCY5_5p' Error in .cpp_getParent(obj@pointer, sampleNames(obj)[1], y) : Alexa700_gate not found!

I assume the "Alexa700_gate not found!" error occurs because gt_gating stopped just before the 6th node "/boundary/singlet/Alexa700_gate" in gt_nodes due to stop.at = "/boundary/singlet/Alexa700_gate" as expected, and then starts at "/boundary/singlet/Lymph", 7th node in the gt_nodes using the index where nodePaths and start = "/boundary/singlet/Alexa700_gate" matched.

I created a function that fixes the start argument and it seems to be working as expected.

gt_gating_fixstart <- function(gt, gs, start = "root", ...){
    if(start != "root") start <- names(sapply(gt_get_nodes(gt), openCyto:::alias))[match(start, RBGL::tsort(gt)[-1])]
    gt_gating(gt, gs, start = start, ...)
}
gt_gating_fixstart(gt, gs, stop.at = "/boundary/singlet/Alexa700_gate")

Gating for 'boundary' done! done. Gating for 'singlet' done! done. Gating for 'PerCPCY5_5_gate' done! done. Gating for 'APC_gate' done! done. Gating for 'AmCyan_gate' done! done. stop at: /boundary/singlet/Alexa700_gate finished.

gt_gating_fixstart(gt, gs, start = "/boundary/singlet/Alexa700_gate")

Gating for 'Alexa700_gate' done! done. Preprocessing for 'Lymph' Gating for 'Lymph' Using the serial version of flowClust done! done. Population 'Alexa700pPerCPCY5_5p' done! done. Population 'singletRefGate' done! done. Population 'singletRefGate_LymphRefGate' done! done. finished. Warning messages: 1: In grepl(pname, pd$name, ignore.case = T) : argument 'pattern' has length > 1 and only the first element will be used 2: In grepl(pname, pd$name, ignore.case = T) : argument 'pattern' has length > 1 and only the first element will be used

I propose the changes to the source code as follows.

-  nodePaths <- names(sapply(gt_get_nodes(gt),alias))
+  gt_nodes <- tsort(gt)#by the topological order
+  # deprecation of the nodePaths variable and its replacement by the gt_nodes  variable with "root" node 
    #validity check for stop.at argument
  if(!is.null(stop.at)){
    #escape meta character
    stop.pat <- gsub("\\+", "\\\\+", stop.at)
    stop.pat <- paste0(stop.pat, "$")
    if(substr(stop.at, 1,1) != "/")
      stop.pat <- paste0("/", stop.pat) #prepend path delimiter to avoid partial node name matching

-    matchInd <- grep(stop.pat, nodePaths)
+    matchInd <- grep(stop.pat, gt_nodes[-1]) # exclude "root" node from stop points
    if(length(matchInd) == 0)
      stop("Can't find stop point: ", stop.at)
    else if(length(matchInd) > 1)
      stop("ambiguous stop point: ", stop.at)
  }
  # gate each node 
-  gt_nodes <- tsort(gt)[-1]#by the topological order

  #try to skip some nodes to save time
  if(start != "root"){
    #escape meta character
    start.pat <- gsub("\\+", "\\\\+", start)
    start.pat <- paste0(start.pat, "$") #treat it as terminal node
    if(substr(start, 1,1) != "/")
      start.pat <- paste0("/", start.pat) #prepend path delimiter to avoid partial node name matching
-    matchInd <- grep(start.pat, nodePaths)
+    matchInd <- grep(start.pat, gt_nodes)

    if(length(matchInd) == 0)
      stop("Can't find start point: ", start)
    else if(length(matchInd) > 1)
      stop("ambiguous start point: ", start)

    gt_nodes <- gt_nodes[matchInd:length(gt_nodes)]
  }  
mikejiang commented 2 years ago

fixed by https://github.com/RGLab/openCyto/pull/240, however csv based gating is the legacy interface, (including start/stop arguments), I would highly recommend you switch to gs_add_gating_method to incrementally and interactively build/test/trouble shoot each gating step.

htanaka1234 commented 2 years ago

Thanks for the fix, I have confirmed that start argument works fine with openCyto 2.8.3. Thanks for the information as well. I will work on using gs_add_gating_method.