Difference between openCyto.count and xml.count

souckmi commented 3 years ago

Hello,

I encountered an issue, that in some cases, the difference between the count of events gated in FlowJo and the count of events calculated by your packages (checking with gh_pop_compare_stats()) is significantly different (I am aware that you warned, that there can be slight differences due to numeric flaws and I also read the closed issue https://github.com/RGLab/flowWorkspace/issues/256 , but I came across a case when the calculated amount was more than twice greater).

Using the following code (following bioconductor manual):

ws <- open_flowjo_xml(wsfile)
gs <- flowjo_to_gatingset(ws,name = 1)
gh <- gs[[1]]
head(gh_pop_compare_stats(gh))

the openCyto.count is 2253, while xml.count is 1116.

Unfortunately I did not get the permission to give you the data files, but I will try my best with the chunks and hopefully I will soon get some other data resulting in same behavior, that I could share.

I tried to reproduce part of the process - parsed the .wsp file to get the transformations and gating parameters and tried to apply them on compensated data (for compensation i used flowCore compensate()).

I guess there is probably no documentation or information about the meaning of xml attributes in the .wsp but with the help of Gating-ML documentation at least the rectangular and polygon gates seem to be easy to manage.

Confusing part are the transformations. I tried to work with this population described in .wsp like this (in simplified form):

<Population name="..." count="...">
  ...
  <Gate>
      <gating:RectangleGate>
                   <gating:dimension gating:min="-1194.928615027101"  gating:max="26592.458332987124" >
                     <data-type:fcs-dimension data-type:name="Comp-XXX" />
                   </gating:dimension>
                   <gating:dimension gating:min="31316.73410404626"  gating:max="261133.81502890185" >
                     <data-type:fcs-dimension data-type:name="Comp-YYY" />
                   </gating:dimension>
       </gating:RectangleGate>
  </Gate>
</Population>

The transforms nodes containing those fcs-dimensions look like this:

<transforms:linear transforms:minRange="-1622.4268757658"  transforms:maxRange="262144"  gain="1" >
               <data-type:parameter data-type:name="XXX" />
</transforms:linear>

<transforms:linear transforms:minRange="-1622.4268757658"  transforms:maxRange="262144"  gain="1" >
           <data-type:parameter data-type:name="Comp-XXX" />
</transforms:linear>

<transforms:linear transforms:minRange="1"  transforms:maxRange="262144.0000000001"  gain="1" >
           <data-type:parameter data-type:name="YYY" />
</transforms:linear>

<transforms:linear transforms:minRange="1"  transforms:maxRange="262144.0000000001"  gain="1" >
           <data-type:parameter data-type:name="Comp-YYY" />
</transforms:linear>

There are only linear transforms. The strange part is that when I applied gating without transforming data at all in this case I got the same amount of events inside the gate that was in the count attribute of the population tag. Yet when using flowWorkspace the result was twice more.

I got similarly good results with other only linearly transformed dims.

I tried to follow your thoughts in flowWorkspace and cytolib etc. but was not really successful and I understand that the whole concept is much more complex, but I only want to ask if you have any idea what could cause such difference between the counts and if my strange results could help in any way?

I checked the versions of your packages to be up to date according to bioconductor:

R version 4.0.4 (2021-02-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale: [1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250 LC_MONETARY=Czech_Czechia.1250 [4] LC_NUMERIC=C LC_TIME=Czech_Czechia.1250

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] CytoML_2.2.2 flowWorkspace_4.2.0 flowCore_2.2.0

I understand that my issue may be closed if I did not follow the instruction of creating a new issue. Thank you very much

mikejiang commented 3 years ago

Can you paste the results from gh_pop_compare_stats() and the ggcyto::autoplot(gs, pop_name) for the gates in question?

souckmi commented 3 years ago

Sure,

>   gh_pop_compare_stats(gh)  
   openCyto.freq     xml.freq openCyto.count xml.count  node
1:  1.000000e+00 1.000000e+00         630924    630924  root
2:  3.570953e-03 1.768834e-03           2253      1116  1117
3:  6.066024e-02 6.066024e-02          38272     38272 38272
4:  3.340040e-01 3.340040e-01          12783     12783 12783
5:  5.848565e-04 5.832715e-04            369       368   368
6:  6.339908e-06 6.339908e-06              4         4 ctyri
7:  1.584977e-06 1.584977e-06              1         1 jedna

>  ggcyto::autoplot(gs, "1117")

1117ggcyto

mikejiang commented 3 years ago

It is likely that the gate is extended to the far left because the parser sees the gate left bound is negative (i.e. -1194). You can try to disable the extension behavior by relaxing the threshold extend_val = -2000 (default is 0).

mikejiang commented 3 years ago

see https://github.com/RGLab/CytoML/issues/106 for more on the reason for gate extension

PedroMilanezAlmeida commented 3 years ago

Hey @mikejiang

I am getting the opposite problem. That is, I am getting far fewer cells in openCyto.count than in xml.count (second row, "Time Gate", is the issue):

> gh_pop_compare_stats(gs[[4]])
    openCyto.freq     xml.freq openCyto.count xml.count                      node
 1:  1.0000000000 1.0000000000        3044896   3044896                      root
 2:  0.1697601494 0.9802801147         516902   2984851                 Time Gate
 3:  0.8117380084 0.8607103001         419589   2569092                     Cells
 4:  0.9080004481 0.9120366262         380987   2343106        Cells/Single Cells
 5:  0.9641641316 0.9624165531         367334   2255044 Single Cells/Single Cells
 6:  0.7932562736 0.8143734668         291390   1836448                      Live
 7:  0.9962043996 0.9966342635         290284   1830267                     CD45+
 8:  0.2014062091 0.2040975442          58465    373553                CD45+/CD3+
 9:  0.9899598050 0.9899398479          57878    369795                 CD3+/CD3+
10:  0.7949043006 0.7923111765         230748   1450141             CD3negCD14neg
11:  0.9845112417 0.9851552366         227174   1428614                     CD19+
12:  0.0002112918 0.0001637951             48       234                   S1+RBD+

The gate coordinates match well though:

> gh_pop_get_gate(obj = gs[[4]], y = "Time Gate")
Rectangular gate 'Time Gate' with dimensions:
  FSC-A: (0,4441999.11811024)
  Time: (-407.642745098038,50.7552941176469)

When I plot the "Time Gate" with ggcyto vs FlowJo, the values in ggcyto look much more spread out than in FlowJo:

Interestingly, when I look at the range of values for Time, they look similar to FlowJo though:

> gh_pop_get_data(gs[[4]], "Time Gate") %>% exprs() %>% .[,"Time"] %>% range
[1]  0.0000 50.7552

Do you know what could possibly be wrong?

Any help would be much appreciated.

PedroMilanezAlmeida commented 3 years ago

Session info for comment above.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] ggcyto_1.18.0               ncdfFlow_2.36.0            
 [3] BH_1.75.0-0                 RcppArmadillo_0.10.4.0.0   
 [5] leiden_0.3.7                SeuratObject_4.0.0         
 [7] Seurat_4.0.1                uwot_0.1.10                
 [9] Matrix_1.2-18               forcats_0.5.1              
[11] stringr_1.4.0               dplyr_1.0.5                
[13] purrr_0.3.4                 readr_1.4.0                
[15] tidyr_1.1.3                 tibble_3.1.1               
[17] tidyverse_1.3.1             flowWorkspace_4.3.7        
[19] CytoML_2.3.3                scater_1.18.6              
[21] ggplot2_3.3.3               diffcyt_1.10.0             
[23] flowCore_2.3.2              cowplot_1.1.1              
[25] CATALYST_1.14.0             SingleCellExperiment_1.12.0
[27] SummarizedExperiment_1.20.0 Biobase_2.50.0             
[29] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
[31] IRanges_2.24.1              S4Vectors_0.28.1           
[33] BiocGenerics_0.36.1         MatrixGenerics_1.2.1       
[35] matrixStats_0.58.0         

loaded via a namespace (and not attached):
  [1] rsvd_1.0.5                  ica_1.0-2                  
  [3] svglite_2.0.0               lmtest_0.9-38              
  [5] crayon_1.4.1                spatstat.core_2.1-2        
  [7] MASS_7.3-53.1               backports_1.2.1            
  [9] nlme_3.1-152                reprex_2.0.0               
 [11] rlang_0.4.10                XVector_0.30.0             
 [13] ROCR_1.0-11                 readxl_1.3.1               
 [15] irlba_2.3.3                 nloptr_1.2.2.2             
 [17] limma_3.46.0                BiocParallel_1.24.1        
 [19] rjson_0.2.20                glue_1.4.2                 
 [21] pheatmap_1.0.12             sctransform_0.3.2          
 [23] vipor_0.4.5                 spatstat.sparse_2.0-0      
 [25] spatstat.geom_2.1-0         haven_2.4.0                
 [27] tidyselect_1.1.0            rio_0.5.26                 
 [29] fitdistrplus_1.1-3          XML_3.99-0.6               
 [31] zoo_1.8-9                   nnls_1.4                   
 [33] xtable_1.8-4                magrittr_2.0.1             
 [35] evaluate_0.14               scuttle_1.0.4              
 [37] cli_2.4.0                   zlibbioc_1.36.0            
 [39] rstudioapi_0.13             miniUI_0.1.1.1             
 [41] rpart_4.1-15                shiny_1.6.0                
 [43] BiocSingular_1.6.0          xfun_0.22                  
 [45] clue_0.3-59                 cluster_2.1.0              
 [47] ggrepel_0.9.1               listenv_0.8.0              
 [49] png_0.1-7                   future_1.21.0              
 [51] withr_2.4.2                 bitops_1.0-6               
 [53] aws.signature_0.6.0         RBGL_1.66.0                
 [55] plyr_1.8.6                  cellranger_1.1.0           
 [57] pillar_1.6.0                RcppParallel_5.1.2         
 [59] GlobalOptions_0.1.2         multcomp_1.4-16            
 [61] fs_1.5.0                    GetoptLong_1.0.5           
 [63] colortable_0.3.0            DelayedMatrixStats_1.12.3  
 [65] vctrs_0.3.7                 ellipsis_0.3.1             
 [67] generics_0.1.0              tools_4.0.2                
 [69] foreign_0.8-80              beeswarm_0.3.1             
 [71] munsell_0.5.0               aws.s3_0.3.21              
 [73] DelayedArray_0.16.3         fastmap_1.1.0              
 [75] compiler_4.0.2              abind_1.4-5                
 [77] httpuv_1.5.5                plotly_4.9.3               
 [79] GenomeInfoDbData_1.2.4      gridExtra_2.3              
 [81] edgeR_3.32.1                lattice_0.20-41            
 [83] deldir_0.2-10               utf8_1.2.1                 
 [85] later_1.1.0.1               jsonlite_1.7.2             
 [87] scales_1.1.1                graph_1.68.0               
 [89] pbapply_1.4-3               carData_3.0-4              
 [91] sparseMatrixStats_1.2.1     lazyeval_0.2.2             
 [93] promises_1.2.0.1            car_3.0-10                 
 [95] latticeExtra_0.6-29         goftest_1.2-2              
 [97] spatstat.utils_2.1-0        reticulate_1.18-9008       
 [99] rmarkdown_2.7               openxlsx_4.2.3             
[101] sandwich_3.0-0              statmod_1.4.35             
[103] webshot_0.5.2               Rtsne_0.15                 
[105] igraph_1.2.6                survival_3.2-10            
[107] yaml_2.2.1                  plotrix_3.8-1              
[109] systemfonts_1.0.1           cytolib_2.3.8              
[111] htmltools_0.5.1.1           locfit_1.5-9.4             
[113] viridisLite_0.4.0           digest_0.6.27              
[115] assertthat_0.2.1            mime_0.10                  
[117] future.apply_1.7.0          data.table_1.14.0          
[119] drc_3.0-1                   labeling_0.4.2             
[121] splines_4.0.2               Cairo_1.5-12.2             
[123] RCurl_1.98-1.3              broom_0.7.6                
[125] hms_1.0.0                   modelr_0.1.8               
[127] colorspace_2.0-0            ConsensusClusterPlus_1.54.0
[129] base64enc_0.1-3             ggbeeswarm_0.6.0           
[131] shape_1.4.5                 Rcpp_1.0.6                 
[133] RANN_2.6.1                  mvtnorm_1.1-1              
[135] circlize_0.4.12             FlowSOM_1.22.0             
[137] RProtoBufLib_2.3.4          fansi_0.4.2                
[139] parallelly_1.24.0           R6_2.5.0                   
[141] grid_4.0.2                  ggridges_0.5.3             
[143] lifecycle_1.0.0             zip_2.1.1                  
[145] curl_4.3                    minqa_1.2.4                
[147] RcppAnnoy_0.0.18            TH.data_1.0-10             
[149] RColorBrewer_1.1-2          htmlwidgets_1.5.3          
[151] beachmat_2.6.4              polyclip_1.10-0            
[153] rvest_1.0.0                 ComplexHeatmap_2.6.2       
[155] mgcv_1.8-35                 globals_0.14.0             
[157] patchwork_1.1.1             lubridate_1.7.10           
[159] codetools_0.2-16            gtools_3.8.2               
[161] cytoqc_0.99.2               dbplyr_2.1.1               
[163] gtable_0.3.0                tsne_0.1-3                 
[165] DBI_1.1.1                   tensor_1.5                 
[167] httr_1.4.2                  KernSmooth_2.23-17         
[169] stringi_1.5.3               farver_2.1.0               
[171] reshape2_1.4.4              viridis_0.6.0              
[173] hexbin_1.28.2               Rgraphviz_2.34.0           
[175] DT_0.18                     xml2_1.3.2                 
[177] boot_1.3-25                 BiocNeighbors_1.8.2        
[179] kableExtra_1.3.4            lme4_1.1-26                
[181] scattermore_0.7             jpeg_0.1-8.1               
[183] spatstat.data_2.1-0         pkgconfig_2.0.3            
[185] knitr_1.32                  egg_0.4.5

mikejiang commented 3 years ago

Can you provide example workspace and fcs for troubleshooting?

PedroMilanezAlmeida commented 3 years ago

Thank you so much for the quick reply, @mikejiang! I appreciate your help!

The files are here: https://drive.google.com/drive/folders/1GR5QtFTmjRS3X4VL1LK5bXMhtMi1t8Q6

mikejiang commented 3 years ago

@PedroMilanezAlmeida , sorry for the delayed response. It was time channel scaling problem. Original FCS data stores time channel in units. i.e.

 range(fr, "data")[["Time"]]
[1]       0 4171631

and flowJo defines the time gate in the real time seconds.

<gating:dimension gating:min="-407.64274509803823"  gating:max="50.75529411764694" >
                   <data-type:fcs-dimension data-type:name="Time" />

So data needs to be scaled properly, and usually it is done by multiply the scale factor defined in FCS keyword

<Keyword name="$TIMESTEP"  value="0.0001" />

Apparently this value is not accurate for this case. There are another source we can compute the timestep from

<Keyword name="$BTIM"  value="17:03:22.06" />
<Keyword name="$ETIM"  value="17:04:15.89" />

they happen to be correct in this case, but aren't always reliable based on other use cases in the past. So I choose to use the information from the third source

<transforms:linear transforms:minRange="0"  transforms:maxRange="53"  gain="0.0000127049" >
           <data-type:parameter data-type:name="Time" />

which isn't always present, but whenever it is available in xml, we will use that instead.

Now with the latest patches in cytolib and CytoML, the gate should be fixed.

PedroMilanezAlmeida commented 3 years ago

Thank you very much, @mikejiang! I will try the latest patches and get back to you if I still have any issues.

Kelly-A-W commented 2 years ago

Hello @mikejiang, as mentioned you have said that it is possible that the counts calculated in FlowJo will differ to those calculated using your package because of numerical errors. I wonder if you could possibly clarify if these numerical errors are from your package or from FlowJo (or from both) and which counts are more accurate ? And if there is known to be any bias, such as whether the counts calculated are generally under- or over-estimated ? I have tried to find information online but have been unsuccessful so far.

Some context: I am a masters in statistics student working on a research paper with an immunological institute. To speed up the gating and data wrangling process, they have asked me to try generate count data in R using the gh_pop_get_indices() from your package to count the number of cells expressing a certain combination of cytokines, for every possible combination of cytokines. My supervisors are trying to assess from an immunological perspective whether the differences in the counts I produced in R compared to the FlowJo counts are significant or not and whether they can use the R counts instead. To guide us in our decision making, it would be very helpful to know if one is more accurate than the other and why.

gfinak commented 2 years ago

Both should be reliable and fit for purpose. When there are differences the question is where they lead to significant differences in downstream inference. We've observed numerical differences for three primary reasons in the past and most of these have been resolved I believe.

Gates go through regions of high cell density and differences are due to cells on the boundary. This may no longer be an issue. Mike can provide more insights.
Gates go through regions where data transition between linear and logarithmic scale (the estimation of these transforms from FlowJo workspaces can be subject to numerical effects).
Gates were updated in flowJo but not "recomputed" (v9 of flowJo had such a feature), and data was not exported correctly. In those cases it is the flowJo exports that are erroneous.

OpenCyto has been used for years to support vaccine clinical trials within the Fred Hutch and has been validated there.

Best thing you can do is test it out and compare then track down the source of discrepancies to understand they're discrepancies and assess whether these lead to differences in downstream inference.

mikejiang commented 2 years ago

Gates go through regions of high cell density and differences are due to cells on the boundary. This may no longer be an issue.

yes, it was resolved by https://github.com/RGLab/flowCore/issues/86

Kelly-A-W commented 2 years ago

Thank you so much @gfinak for the detailed response and @mikejiang for the confirmation ! The information was very helpful and I suspect that our discrepancies are due to your second point.

dpschreiner commented 1 year ago

Is there a common explanation for the situation where the openCyto.counts are all 0 (except for root) while xml.count has values? Thanks in advance!

mikejiang commented 1 year ago

It is likely gates are not parsed/transformed properly, which can be caused by different reasons, has to diagnose case by case by providing reproducible wsp and fcs files

RGLab / flowWorkspace

Difference between openCyto.count and xml.count #356