RGLab / flowCore

Core flow cytometry infrastructure
43 stars 25 forks source link

Still misleaded by range #187

Closed SamGG closed 3 years ago

SamGG commented 4 years ago

Describe the bug The ranges written on disk are integers, truncated to the lowest values, but data are not truncatedwhen read back. I don't understand at all. If I am wrong (because tired), let me know. On the contrary, can we solve this for good, and discuss the corresponding tests? Best, Samuel

To Reproduce Steps to reproduce the behavior: please use the reprex package to build a reproducible example.

library(flowCore)
(gg = flowFrame(sapply(iris, as.numeric)))
#> flowFrame object 'anonymous'
#> with 150 cells and 5 observables:
#>             name         desc range minRange maxRange
#> $P1 Sepal.Length Sepal.Length   7.9      4.3      7.9
#> $P2  Sepal.Width  Sepal.Width   4.4      2.0      4.4
#> $P3 Petal.Length Petal.Length   6.9      1.0      6.9
#> $P4  Petal.Width  Petal.Width   2.5      0.1      2.5
#> $P5      Species      Species   3.0      1.0      3.0
#> 0 keywords are stored in the 'description' slot
summary(gg)
#>         Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Min.        4.300000    2.000000        1.000    0.100000       1
#> 1st Qu.     5.100000    2.800000        1.600    0.300000       1
#> Median      5.800000    3.000000        4.350    1.300000       2
#> Mean        5.843333    3.057333        3.758    1.199333       2
#> 3rd Qu.     6.400000    3.300000        5.100    1.800000       3
#> Max.        7.900000    4.400000        6.900    2.500000       3
table(exprs(gg[,5]))
#> 
#>  1  2  3 
#> 50 50 50
write.FCS(gg, "gg.fcs")
#> [1] "gg.fcs"
(ggg = read.FCS("gg.fcs"))
#> flowFrame object 'gg.fcs'
#> with 150 cells and 5 observables:
#>             name         desc range minRange maxRange
#> $P1 Sepal.Length Sepal.Length     8        0        7
#> $P2  Sepal.Width  Sepal.Width     5        0        4
#> $P3 Petal.Length Petal.Length     7        0        6
#> $P4  Petal.Width  Petal.Width     3        0        2
#> $P5      Species      Species     3        0        2
#> 51 keywords are stored in the 'description' slot
summary(ggg)
#>         Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Min.        4.300000    2.000000        1.000    0.100000       1
#> 1st Qu.     5.100000    2.800000        1.600    0.300000       1
#> Median      5.800000    3.000000        4.350    1.300000       2
#> Mean        5.843333    3.057333        3.758    1.199333       2
#> 3rd Qu.     6.400000    3.300000        5.100    1.800000       3
#> Max.        7.900000    4.400000        6.900    2.500000       3
table(exprs(ggg[,5]))
#> 
#>  1  2  3 
#> 50 50 50
#keyword(ggg)
(ggg = read.FCS("gg.fcs", truncate_max_range = FALSE, transformation = FALSE))
#> flowFrame object 'gg.fcs'
#> with 150 cells and 5 observables:
#>             name         desc range minRange maxRange
#> $P1 Sepal.Length Sepal.Length     8        0        7
#> $P2  Sepal.Width  Sepal.Width     5        0        4
#> $P3 Petal.Length Petal.Length     7        0        6
#> $P4  Petal.Width  Petal.Width     3        0        2
#> $P5      Species      Species     3        0        2
#> 40 keywords are stored in the 'description' slot
summary(ggg)
#>         Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Min.        4.300000    2.000000        1.000    0.100000       1
#> 1st Qu.     5.100000    2.800000        1.600    0.300000       1
#> Median      5.800000    3.000000        4.350    1.300000       2
#> Mean        5.843333    3.057333        3.758    1.199333       2
#> 3rd Qu.     6.400000    3.300000        5.100    1.800000       3
#> Max.        7.900000    4.400000        6.900    2.500000       3
table(exprs(ggg[,5]))
#> 
#>  1  2  3 
#> 50 50 50
#keyword(ggg)
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] flowCore_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6        matrixStats_0.56.0  RProtoBufLib_2.0.0 
#>  [4] digest_0.6.25       stats4_4.0.0        magrittr_1.5       
#>  [7] evaluate_0.14       RcppParallel_5.0.0  highr_0.8          
#> [10] rlang_0.4.5         stringi_1.4.6       rmarkdown_2.1      
#> [13] tools_4.0.0         stringr_1.4.0       Biobase_2.48.0     
#> [16] parallel_4.0.0      xfun_0.13           yaml_2.2.1         
#> [19] compiler_4.0.0      BiocGenerics_0.34.0 cytolib_2.0.0      
#> [22] htmltools_0.4.0     knitr_1.28

Created on 2020-05-07 by the reprex package (v0.3.0)

mikejiang commented 4 years ago

First of all, this manually crafted (out of a matrix) flowFrame doesn't carry any keywords (especially $PnX ), so write.FCS is trying to impute these values, including $PnR. And its value needs to be Integer (at least it is the case when file is generated from the instruments). And write.FCS computes the maximum data value and round it to ceiling, which you can see through keywords after you load it back

write.FCS(gg, tmp)

fr <- read.FCS(tmp)
> range(fr)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
min            0           0            0           0       0
max            7           4            6           2       2
> range(fr, "data")
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
min          4.3         2.0          1.0         0.1       1
max          7.9         4.4          6.9         2.5       3
> unlist(keyword(fr)[paste0("$P", 1:5, "R")])
$P1R $P2R $P3R $P4R $P5R 
 "8"  "5"  "7"  "3"  "3

So, PnR isn't actually truncated. But maxRange value(returned by range or summary) is computed from PnR and it always gets subtracted by 1, ( It is the legacy behavior and has been these way for a long time, and I don't know the exact reason, maybe has to do with the PnR definition of FCS standard?).

We can bump PnR value by 1 during write.FCS so that range(fr) and range(fr, "data") return the consistent range (without appearing to be truncated) if it bothers you.

Again this doesn't affect the FCS files that come from instruments or other softwares.

mikejiang commented 4 years ago

After the patch, here is the range info

> write.FCS(gg, tmp)
[1] "/tmp/RtmpKbY7ug/filec4335b7cea1"
> fr <- read.FCS(tmp)
> range(fr)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
min            0           0            0           0       0
max            8           5            7           3       3
> range(fr, "data")
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
min          4.3         2.0          1.0         0.1       1
max          7.9         4.4          6.9         2.5       3
> unlist(keyword(fr)[paste0("$P", 1:5, "R")])
$P1R $P2R $P3R $P4R $P5R 
 "9"  "6"  "8"  "4"  "4" 
gfinak commented 4 years ago

Thanks, Mike

SamGG commented 4 years ago

Thanks Mike for fixing. In fact, I noticed this while explaining how to concatenate FCS. So, the faulty FCS comes out instruments in a sense. I should have given you a better example. It is below. I already seen some real problems linked to this behaviour one year ago. If the Original holds sample or cluster identifiers, truncation is a big problem. The last sample or cluster will be merged with the previous one. While reading FCS files is the most important point, every analytical pipeline should export its results, and it turns to be FCS files frequently, in order to allow experimenters to verify and/or tune the pipeline. So this is very important IMO. And it's even more important because most of flow cytometry packages tend to have their own/customized FCS writing functions that overloads the write.FCS function.

To recap the example below: a) warnings concerning GvHD b) NA range after as(), although summary is OK c) keywords are not read during concatenation (just to mention it) d) truncation is not taking place as I expect, but maybe it's a border effect: if maxRange is 34, I imagine that 35 should be truncated, but maybe it's 36 or 35.00001 that will be the first truncated values to 35 (?). Definitively those dual variables range and maxRange are a nightmare.

I think your fix is right, and it is important. I will keep in mind to be careful when flowFrame has no range keyword, as from matrix or concatenation. Check if everthing is correct in the new example, your fix has been not used yet. Thanks again.

library(flowCore)
data("GvHD")
#> Warning in load(zfile, envir = tmp_env): strings not representable in native
#> encoding will be translated to UTF-8
#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?

#> Warning in load(zfile, envir = tmp_env): input string 'CELLQuestª' cannot be
#> translated to UTF-8, is it valid in 'UTF-8' ?
warnings()[1:5]
#> NULL
(gg = as(GvHD, "flowFrame"))
#> flowFrame object 'anonymous'
#> with 585644 cells and 9 observables:
#>              name              desc range minRange maxRange
#> $P1         FSC-H        FSC-Height  1024        0     1023
#> $P2         SSC-H        SSC-Height  1024        0     1023
#> $P3         FL1-H         CD15 FITC  1024        1    10000
#> $P4         FL2-H           CD45 PE  1024        1    10000
#> $P5         FL3-H        CD14 PerCP  1024        1    10000
#> $P6         FL2-A              <NA>  1024        0     1023
#> $P7         FL4-H          CD33 APC  1024        1    10000
#> $P8          Time Time (51.20 sec.)  1024        0     1023
#> Original Original    Original Frame    NA        1       35
#> 2 keywords are stored in the 'description' slot
summary(gg)
#>             FSC-H     SSC-H        FL1-H      FL2-H        FL3-H     FL2-A
#> Min.      52.0000    0.0000     1.000000     1.0000     1.000000    0.0000
#> 1st Qu.   88.0000   67.0000     2.111231   227.9141     1.626087   56.0000
#> Median   130.0000  101.0000     6.744263   447.7405     4.222901  112.0000
#> Mean     166.5125  129.6283   226.508527   567.4725    26.534154  139.5429
#> 3rd Qu.  214.0000  154.0000    50.674542   768.4763    11.680157  185.0000
#> Max.    1023.0000 1023.0000 10000.000000 10000.0000 10000.000000 1023.0000
#>                FL4-H     Time Original
#> Min.        1.000000   0.0000  1.00000
#> 1st Qu.     1.878040  71.0000 15.00000
#> Median      5.010744 191.0000 20.00000
#> Mean       21.662928 260.2665 21.04099
#> 3rd Qu.    26.740836 375.0000 27.00000
#> Max.    10000.000000 999.0000 35.00000
table(exprs(gg[,9]))
#> 
#>     1     2     3     4     5     6     7     8     9    10    11    12    13 
#>  3420  3405  3435  8550 10410  3750 13979  2205 13350 11610 13830 13934 19470 
#>    14    15    16    17    18    19    20    21    22    23    24    25    26 
#>  9405 16055 10305 30195 59849 39068 11939 13214 12666  8265  4065  7080 66105 
#>    27    28    29    30    31    32    33    34    35 
#> 49421 12768 17289  4530  6765  9540 25515 24656 25601
keyword(gg)
#> $description
#> [1] "Synthetic Frame"
#> 
#> $sampleNames
#>  [1] "s5a01"  "s5a02"  "s5a03"  "s5a04"  "s5a05"  "s5a06"  "s5a07"  "s6a01" 
#>  [9] "s6a02"  "s6a03"  "s6a04"  "s6a05"  "s6a06"  "s6a07"  "s7a01"  "s7a02" 
#> [17] "s7a03"  "s7a04"  "s7a05"  "s7a06"  "s7a07"  "s9a01"  "s9a02"  "s9a03" 
#> [25] "s9a04"  "s9a05"  "s9a06"  "s9a07"  "s10a01" "s10a02" "s10a03" "s10a04"
#> [33] "s10a05" "s10a06" "s10a07"
write.FCS(gg, "gg.fcs")
#> [1] "gg.fcs"
(ggg = read.FCS("gg.fcs"))
#> flowFrame object 'gg.fcs'
#> with 585644 cells and 9 observables:
#>         name              desc range minRange maxRange
#> $P1    FSC-H        FSC-Height  1023        0     1022
#> $P2    SSC-H        SSC-Height  1023        0     1022
#> $P3    FL1-H         CD15 FITC 10000        0     9999
#> $P4    FL2-H           CD45 PE 10000        0     9999
#> $P5    FL3-H        CD14 PerCP 10000        0     9999
#> $P6    FL2-A              <NA>  1023        0     1022
#> $P7    FL4-H          CD33 APC 10000        0     9999
#> $P8     Time Time (51.20 sec.)   999        0      998
#> $P9 Original    Original Frame    35        0       34
#> 80 keywords are stored in the 'description' slot
summary(ggg)
#>             FSC-H     SSC-H        FL1-H      FL2-H        FL3-H     FL2-A
#> Min.      52.0000    0.0000     1.000000     1.0000     1.000000    0.0000
#> 1st Qu.   88.0000   67.0000     2.111231   227.9141     1.626087   56.0000
#> Median   130.0000  101.0000     6.744263   447.7405     4.222901  112.0000
#> Mean     166.5125  129.6283   226.508527   567.4725    26.534154  139.5429
#> 3rd Qu.  214.0000  154.0000    50.674541   768.4763    11.680157  185.0000
#> Max.    1023.0000 1023.0000 10000.000000 10000.0000 10000.000000 1023.0000
#>                FL4-H     Time Original
#> Min.        1.000000   0.0000  1.00000
#> 1st Qu.     1.878040  71.0000 15.00000
#> Median      5.010745 191.0000 20.00000
#> Mean       21.662928 260.2665 21.04099
#> 3rd Qu.    26.740835 375.0000 27.00000
#> Max.    10000.000000 999.0000 35.00000
table(exprs(ggg[,9]))
#> 
#>     1     2     3     4     5     6     7     8     9    10    11    12    13 
#>  3420  3405  3435  8550 10410  3750 13979  2205 13350 11610 13830 13934 19470 
#>    14    15    16    17    18    19    20    21    22    23    24    25    26 
#>  9405 16055 10305 30195 59849 39068 11939 13214 12666  8265  4065  7080 66105 
#>    27    28    29    30    31    32    33    34    35 
#> 49421 12768 17289  4530  6765  9540 25515 24656 25601
#keyword(ggg)
(ggg = read.FCS("gg.fcs", truncate_max_range = FALSE, transformation = FALSE))
#> flowFrame object 'gg.fcs'
#> with 585644 cells and 9 observables:
#>         name              desc range minRange maxRange
#> $P1    FSC-H        FSC-Height  1023        0     1022
#> $P2    SSC-H        SSC-Height  1023        0     1022
#> $P3    FL1-H         CD15 FITC 10000        0     9999
#> $P4    FL2-H           CD45 PE 10000        0     9999
#> $P5    FL3-H        CD14 PerCP 10000        0     9999
#> $P6    FL2-A              <NA>  1023        0     1022
#> $P7    FL4-H          CD33 APC 10000        0     9999
#> $P8     Time Time (51.20 sec.)   999        0      998
#> $P9 Original    Original Frame    35        0       34
#> 61 keywords are stored in the 'description' slot
summary(ggg)
#>             FSC-H     SSC-H        FL1-H      FL2-H        FL3-H     FL2-A
#> Min.      52.0000    0.0000     1.000000     1.0000     1.000000    0.0000
#> 1st Qu.   88.0000   67.0000     2.111231   227.9141     1.626087   56.0000
#> Median   130.0000  101.0000     6.744263   447.7405     4.222901  112.0000
#> Mean     166.5125  129.6283   226.508527   567.4725    26.534154  139.5429
#> 3rd Qu.  214.0000  154.0000    50.674541   768.4763    11.680157  185.0000
#> Max.    1023.0000 1023.0000 10000.000000 10000.0000 10000.000000 1023.0000
#>                FL4-H     Time Original
#> Min.        1.000000   0.0000  1.00000
#> 1st Qu.     1.878040  71.0000 15.00000
#> Median      5.010745 191.0000 20.00000
#> Mean       21.662928 260.2665 21.04099
#> 3rd Qu.    26.740835 375.0000 27.00000
#> Max.    10000.000000 999.0000 35.00000
table(exprs(ggg[,9]))
#> 
#>     1     2     3     4     5     6     7     8     9    10    11    12    13 
#>  3420  3405  3435  8550 10410  3750 13979  2205 13350 11610 13830 13934 19470 
#>    14    15    16    17    18    19    20    21    22    23    24    25    26 
#>  9405 16055 10305 30195 59849 39068 11939 13214 12666  8265  4065  7080 66105 
#>    27    28    29    30    31    32    33    34    35 
#> 49421 12768 17289  4530  6765  9540 25515 24656 25601
#keyword(ggg)
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] flowCore_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6        matrixStats_0.56.0  RProtoBufLib_2.0.0 
#>  [4] digest_0.6.25       stats4_4.0.0        magrittr_1.5       
#>  [7] evaluate_0.14       RcppParallel_5.0.0  highr_0.8          
#> [10] rlang_0.4.5         stringi_1.4.6       rmarkdown_2.1      
#> [13] tools_4.0.0         stringr_1.4.0       Biobase_2.48.0     
#> [16] parallel_4.0.0      xfun_0.13           yaml_2.2.1         
#> [19] compiler_4.0.0      BiocGenerics_0.34.0 cytolib_2.0.0      
#> [22] htmltools_0.4.0     knitr_1.28

Created on 2020-05-07 by the reprex package (v0.3.0)

SamGG commented 3 years ago

I think it's OK. I will reopen if needed. Thanks.