Open souckmi opened 3 years ago
Can you paste the results from gh_pop_compare_stats()
and the ggcyto::autoplot(gs, pop_name)
for the gates in question?
Sure,
> gh_pop_compare_stats(gh)
openCyto.freq xml.freq openCyto.count xml.count node
1: 1.000000e+00 1.000000e+00 630924 630924 root
2: 3.570953e-03 1.768834e-03 2253 1116 1117
3: 6.066024e-02 6.066024e-02 38272 38272 38272
4: 3.340040e-01 3.340040e-01 12783 12783 12783
5: 5.848565e-04 5.832715e-04 369 368 368
6: 6.339908e-06 6.339908e-06 4 4 ctyri
7: 1.584977e-06 1.584977e-06 1 1 jedna
> ggcyto::autoplot(gs, "1117")
It is likely that the gate is extended to the far left because the parser sees the gate left bound is negative (i.e. -1194).
You can try to disable the extension behavior by relaxing the threshold extend_val = -2000
(default is 0).
see https://github.com/RGLab/CytoML/issues/106 for more on the reason for gate extension
Hey @mikejiang
I am getting the opposite problem. That is, I am getting far fewer cells in openCyto.count than in xml.count (second row, "Time Gate"
, is the issue):
> gh_pop_compare_stats(gs[[4]])
openCyto.freq xml.freq openCyto.count xml.count node
1: 1.0000000000 1.0000000000 3044896 3044896 root
2: 0.1697601494 0.9802801147 516902 2984851 Time Gate
3: 0.8117380084 0.8607103001 419589 2569092 Cells
4: 0.9080004481 0.9120366262 380987 2343106 Cells/Single Cells
5: 0.9641641316 0.9624165531 367334 2255044 Single Cells/Single Cells
6: 0.7932562736 0.8143734668 291390 1836448 Live
7: 0.9962043996 0.9966342635 290284 1830267 CD45+
8: 0.2014062091 0.2040975442 58465 373553 CD45+/CD3+
9: 0.9899598050 0.9899398479 57878 369795 CD3+/CD3+
10: 0.7949043006 0.7923111765 230748 1450141 CD3negCD14neg
11: 0.9845112417 0.9851552366 227174 1428614 CD19+
12: 0.0002112918 0.0001637951 48 234 S1+RBD+
The gate coordinates match well though:
> gh_pop_get_gate(obj = gs[[4]], y = "Time Gate")
Rectangular gate 'Time Gate' with dimensions:
FSC-A: (0,4441999.11811024)
Time: (-407.642745098038,50.7552941176469)
When I plot the "Time Gate" with ggcyto vs FlowJo, the values in ggcyto look much more spread out than in FlowJo:
Interestingly, when I look at the range of values for Time, they look similar to FlowJo though:
> gh_pop_get_data(gs[[4]], "Time Gate") %>% exprs() %>% .[,"Time"] %>% range
[1] 0.0000 50.7552
Do you know what could possibly be wrong?
Any help would be much appreciated.
Session info for comment above.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] ggcyto_1.18.0 ncdfFlow_2.36.0
[3] BH_1.75.0-0 RcppArmadillo_0.10.4.0.0
[5] leiden_0.3.7 SeuratObject_4.0.0
[7] Seurat_4.0.1 uwot_0.1.10
[9] Matrix_1.2-18 forcats_0.5.1
[11] stringr_1.4.0 dplyr_1.0.5
[13] purrr_0.3.4 readr_1.4.0
[15] tidyr_1.1.3 tibble_3.1.1
[17] tidyverse_1.3.1 flowWorkspace_4.3.7
[19] CytoML_2.3.3 scater_1.18.6
[21] ggplot2_3.3.3 diffcyt_1.10.0
[23] flowCore_2.3.2 cowplot_1.1.1
[25] CATALYST_1.14.0 SingleCellExperiment_1.12.0
[27] SummarizedExperiment_1.20.0 Biobase_2.50.0
[29] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
[31] IRanges_2.24.1 S4Vectors_0.28.1
[33] BiocGenerics_0.36.1 MatrixGenerics_1.2.1
[35] matrixStats_0.58.0
loaded via a namespace (and not attached):
[1] rsvd_1.0.5 ica_1.0-2
[3] svglite_2.0.0 lmtest_0.9-38
[5] crayon_1.4.1 spatstat.core_2.1-2
[7] MASS_7.3-53.1 backports_1.2.1
[9] nlme_3.1-152 reprex_2.0.0
[11] rlang_0.4.10 XVector_0.30.0
[13] ROCR_1.0-11 readxl_1.3.1
[15] irlba_2.3.3 nloptr_1.2.2.2
[17] limma_3.46.0 BiocParallel_1.24.1
[19] rjson_0.2.20 glue_1.4.2
[21] pheatmap_1.0.12 sctransform_0.3.2
[23] vipor_0.4.5 spatstat.sparse_2.0-0
[25] spatstat.geom_2.1-0 haven_2.4.0
[27] tidyselect_1.1.0 rio_0.5.26
[29] fitdistrplus_1.1-3 XML_3.99-0.6
[31] zoo_1.8-9 nnls_1.4
[33] xtable_1.8-4 magrittr_2.0.1
[35] evaluate_0.14 scuttle_1.0.4
[37] cli_2.4.0 zlibbioc_1.36.0
[39] rstudioapi_0.13 miniUI_0.1.1.1
[41] rpart_4.1-15 shiny_1.6.0
[43] BiocSingular_1.6.0 xfun_0.22
[45] clue_0.3-59 cluster_2.1.0
[47] ggrepel_0.9.1 listenv_0.8.0
[49] png_0.1-7 future_1.21.0
[51] withr_2.4.2 bitops_1.0-6
[53] aws.signature_0.6.0 RBGL_1.66.0
[55] plyr_1.8.6 cellranger_1.1.0
[57] pillar_1.6.0 RcppParallel_5.1.2
[59] GlobalOptions_0.1.2 multcomp_1.4-16
[61] fs_1.5.0 GetoptLong_1.0.5
[63] colortable_0.3.0 DelayedMatrixStats_1.12.3
[65] vctrs_0.3.7 ellipsis_0.3.1
[67] generics_0.1.0 tools_4.0.2
[69] foreign_0.8-80 beeswarm_0.3.1
[71] munsell_0.5.0 aws.s3_0.3.21
[73] DelayedArray_0.16.3 fastmap_1.1.0
[75] compiler_4.0.2 abind_1.4-5
[77] httpuv_1.5.5 plotly_4.9.3
[79] GenomeInfoDbData_1.2.4 gridExtra_2.3
[81] edgeR_3.32.1 lattice_0.20-41
[83] deldir_0.2-10 utf8_1.2.1
[85] later_1.1.0.1 jsonlite_1.7.2
[87] scales_1.1.1 graph_1.68.0
[89] pbapply_1.4-3 carData_3.0-4
[91] sparseMatrixStats_1.2.1 lazyeval_0.2.2
[93] promises_1.2.0.1 car_3.0-10
[95] latticeExtra_0.6-29 goftest_1.2-2
[97] spatstat.utils_2.1-0 reticulate_1.18-9008
[99] rmarkdown_2.7 openxlsx_4.2.3
[101] sandwich_3.0-0 statmod_1.4.35
[103] webshot_0.5.2 Rtsne_0.15
[105] igraph_1.2.6 survival_3.2-10
[107] yaml_2.2.1 plotrix_3.8-1
[109] systemfonts_1.0.1 cytolib_2.3.8
[111] htmltools_0.5.1.1 locfit_1.5-9.4
[113] viridisLite_0.4.0 digest_0.6.27
[115] assertthat_0.2.1 mime_0.10
[117] future.apply_1.7.0 data.table_1.14.0
[119] drc_3.0-1 labeling_0.4.2
[121] splines_4.0.2 Cairo_1.5-12.2
[123] RCurl_1.98-1.3 broom_0.7.6
[125] hms_1.0.0 modelr_0.1.8
[127] colorspace_2.0-0 ConsensusClusterPlus_1.54.0
[129] base64enc_0.1-3 ggbeeswarm_0.6.0
[131] shape_1.4.5 Rcpp_1.0.6
[133] RANN_2.6.1 mvtnorm_1.1-1
[135] circlize_0.4.12 FlowSOM_1.22.0
[137] RProtoBufLib_2.3.4 fansi_0.4.2
[139] parallelly_1.24.0 R6_2.5.0
[141] grid_4.0.2 ggridges_0.5.3
[143] lifecycle_1.0.0 zip_2.1.1
[145] curl_4.3 minqa_1.2.4
[147] RcppAnnoy_0.0.18 TH.data_1.0-10
[149] RColorBrewer_1.1-2 htmlwidgets_1.5.3
[151] beachmat_2.6.4 polyclip_1.10-0
[153] rvest_1.0.0 ComplexHeatmap_2.6.2
[155] mgcv_1.8-35 globals_0.14.0
[157] patchwork_1.1.1 lubridate_1.7.10
[159] codetools_0.2-16 gtools_3.8.2
[161] cytoqc_0.99.2 dbplyr_2.1.1
[163] gtable_0.3.0 tsne_0.1-3
[165] DBI_1.1.1 tensor_1.5
[167] httr_1.4.2 KernSmooth_2.23-17
[169] stringi_1.5.3 farver_2.1.0
[171] reshape2_1.4.4 viridis_0.6.0
[173] hexbin_1.28.2 Rgraphviz_2.34.0
[175] DT_0.18 xml2_1.3.2
[177] boot_1.3-25 BiocNeighbors_1.8.2
[179] kableExtra_1.3.4 lme4_1.1-26
[181] scattermore_0.7 jpeg_0.1-8.1
[183] spatstat.data_2.1-0 pkgconfig_2.0.3
[185] knitr_1.32 egg_0.4.5
Can you provide example workspace and fcs for troubleshooting?
Thank you so much for the quick reply, @mikejiang! I appreciate your help!
The files are here: https://drive.google.com/drive/folders/1GR5QtFTmjRS3X4VL1LK5bXMhtMi1t8Q6
@PedroMilanezAlmeida , sorry for the delayed response. It was time channel scaling problem. Original FCS data stores time channel in units. i.e.
range(fr, "data")[["Time"]]
[1] 0 4171631
and flowJo defines the time gate in the real time seconds.
<gating:dimension gating:min="-407.64274509803823" gating:max="50.75529411764694" >
<data-type:fcs-dimension data-type:name="Time" />
So data needs to be scaled properly, and usually it is done by multiply the scale factor defined in FCS keyword
<Keyword name="$TIMESTEP" value="0.0001" />
Apparently this value is not accurate for this case. There are another source we can compute the timestep from
<Keyword name="$BTIM" value="17:03:22.06" />
<Keyword name="$ETIM" value="17:04:15.89" />
they happen to be correct in this case, but aren't always reliable based on other use cases in the past. So I choose to use the information from the third source
<transforms:linear transforms:minRange="0" transforms:maxRange="53" gain="0.0000127049" >
<data-type:parameter data-type:name="Time" />
which isn't always present, but whenever it is available in xml, we will use that instead.
Now with the latest patches in cytolib and CytoML, the gate should be fixed.
Thank you very much, @mikejiang! I will try the latest patches and get back to you if I still have any issues.
Hello @mikejiang, as mentioned you have said that it is possible that the counts calculated in FlowJo will differ to those calculated using your package because of numerical errors. I wonder if you could possibly clarify if these numerical errors are from your package or from FlowJo (or from both) and which counts are more accurate ? And if there is known to be any bias, such as whether the counts calculated are generally under- or over-estimated ? I have tried to find information online but have been unsuccessful so far.
Some context: I am a masters in statistics student working on a research paper with an immunological institute. To speed up the gating and data wrangling process, they have asked me to try generate count data in R using the gh_pop_get_indices() from your package to count the number of cells expressing a certain combination of cytokines, for every possible combination of cytokines. My supervisors are trying to assess from an immunological perspective whether the differences in the counts I produced in R compared to the FlowJo counts are significant or not and whether they can use the R counts instead. To guide us in our decision making, it would be very helpful to know if one is more accurate than the other and why.
Both should be reliable and fit for purpose. When there are differences the question is where they lead to significant differences in downstream inference. We've observed numerical differences for three primary reasons in the past and most of these have been resolved I believe.
OpenCyto has been used for years to support vaccine clinical trials within the Fred Hutch and has been validated there.
Best thing you can do is test it out and compare then track down the source of discrepancies to understand they're discrepancies and assess whether these lead to differences in downstream inference.
- Gates go through regions of high cell density and differences are due to cells on the boundary. This may no longer be an issue.
yes, it was resolved by https://github.com/RGLab/flowCore/issues/86
Thank you so much @gfinak for the detailed response and @mikejiang for the confirmation ! The information was very helpful and I suspect that our discrepancies are due to your second point.
Is there a common explanation for the situation where the openCyto.counts are all 0 (except for root) while xml.count has values? Thanks in advance!
It is likely gates are not parsed/transformed properly, which can be caused by different reasons, has to diagnose case by case by providing reproducible wsp and fcs files
Hello,
I encountered an issue, that in some cases, the difference between the count of events gated in FlowJo and the count of events calculated by your packages (checking with
gh_pop_compare_stats()
) is significantly different (I am aware that you warned, that there can be slight differences due to numeric flaws and I also read the closed issue https://github.com/RGLab/flowWorkspace/issues/256 , but I came across a case when the calculated amount was more than twice greater).Using the following code (following bioconductor manual):
the
openCyto.count
is 2253, whilexml.count
is 1116.Unfortunately I did not get the permission to give you the data files, but I will try my best with the chunks and hopefully I will soon get some other data resulting in same behavior, that I could share.
I tried to reproduce part of the process - parsed the .wsp file to get the transformations and gating parameters and tried to apply them on compensated data (for compensation i used flowCore
compensate()
).I guess there is probably no documentation or information about the meaning of xml attributes in the .wsp but with the help of Gating-ML documentation at least the rectangular and polygon gates seem to be easy to manage.
Confusing part are the transformations. I tried to work with this population described in .wsp like this (in simplified form):
The transforms nodes containing those fcs-dimensions look like this:
There are only linear transforms. The strange part is that when I applied gating without transforming data at all in this case I got the same amount of events inside the gate that was in the count attribute of the population tag. Yet when using flowWorkspace the result was twice more.
I got similarly good results with other only linearly transformed dims.
I tried to follow your thoughts in flowWorkspace and cytolib etc. but was not really successful and I understand that the whole concept is much more complex, but I only want to ask if you have any idea what could cause such difference between the counts and if my strange results could help in any way?
I checked the versions of your packages to be up to date according to bioconductor:
I understand that my issue may be closed if I did not follow the instruction of creating a new issue. Thank you very much