Open baj12 opened 4 years ago
It seems I have to do a deep copy using realize_view:
` pop2Dat = realize_view(popDat, tempfile(pattern="popData"))
cf_append_cols(pop2Dat, mm)
exprs(popDat) `
works. Is this the way to go? Thx for the clarification.
B
Correct, cf_append_cols
will directly modify the cytoframe
passed to it, so depending on your use it is probably best to make a deep copy first as you mention. That said, I am still looking in to the source of your error and will update this issue with what I find.
@baj12 , your issue actually exposed an issue with read-only protection of cytoframes with on-disk backends, so thanks for that. That is fixed by https://github.com/RGLab/cytolib/commit/5afbec8cabdf012b08f19857d440ed139f0c5d82, so now your example will appropriately be blocked from modifying the data in flowWorkspaceData
.
> library(flowCore)
> library(flowWorkspace)
>
> path <- system.file("extdata",package="flowWorkspaceData")
> gs_path <- list.files(path, pattern = "gs_manual",full = TRUE)
> gs <- load_gs(gs_path)
>
> popDat <- gh_pop_get_data(gs[[1]], "CD3+")
> dat <- exprs(popDat)
> mm <- matrix(ncol=1, 1:nrow(popDat), dimnames = list(NULL,"idx"))
> # Now appropriately guarded
> cf_append_cols(popDat, mm)
Error in append_cols(cf@pointer, colnames(cols), cols) :
Can't write to the read-only H5CytoFrame object!
You are right that to append columns to a cytoframe, you will have to do one of two things:
1) realize_view
to get a deep copy
2) If you are sure it is okay to directly modify the on-disk backend files for a cytoframe
, you can use cf_unlock
to remove read-only protection on it or cs_unlock
to remove read-only protection for the entire cytoset
.
Here is an example of how to use the second approach
> library(flowCore)
> library(flowWorkspace)
As part of improvements to flowWorkspace, some behavior of
GatingSet objects has changed. For details, please read the section
titled "The cytoframe and cytoset classes" in the package vignette:
vignette("flowWorkspace-Introduction", "flowWorkspace")
> fcs_files <- list.files(system.file("extdata", package = "flowWorkspaceData"), pattern = "CytoTrol", full.names = TRUE)
>
> cs_tmp <- load_cytoset_from_fcs(fcs_files)
>
> # Simulate a saved cytoset that you might not want to automatically alter
> tmp <- tempfile()
> save_cytoset(cs_tmp, tmp)
Done
To reload it, use 'load_cytoset' function
>
> cs <- load_cytoset(tmp)
> mm <- matrix(ncol=1, 1:nrow(cs[[1]]), dimnames = list(NULL,"idx"))
> # This will be guarded and appropriately error out
> cf_append_cols(cs[[1]], mm)
Error in append_cols(cf@pointer, colnames(cols), cols) :
Can't write to the read-only H5CytoFrame object! >
> # Explicitly remove read-only guards on entire cytoset
> cs_unlock(cs)
>
> # Now this should succeed
> cf_append_cols(cs[[1]], mm)
> dat <- exprs(cs[[1]])
> head(dat)
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time idx
[1,] 140733.05 133376 69150.98 91113.96 22311.24 35576.07 14302.16 16232.649 7644.65 4113.60 12672.00 0.2 1
[2,] 26195.32 26207 65506.79 10115.28 5.04 447.93 682.56 43.700 77.90 -91.20 18.24 0.4 2
[3,] 64294.02 51594 81667.89 174620.03 371.28 851.62 -66.36 335.350 971.85 273.60 271.68 0.6 3
[4,] 128393.87 103613 81210.08 150625.44 1494.36 5672.20 2979.09 1492.450 28790.70 771.84 988.80 0.6 4
[5,] 127717.88 119616 69974.92 76954.91 2545.20 2272.83 124635.93 8608.899 4190.45 14306.88 58977.60 0.7 5
[6,] 134347.02 125651 70071.60 70116.48 23052.96 1758.54 5281.15 4849.750 2859.50 2249.28 1560.96 0.7 6
This seems to work with a cytoset/cytoframe, but what about the GatingSet and GatingHierachy? Here, I get the following error message:
> cf_unlock(gs2[[1]])
Error in cf_unlock(gs2[[1]]) : is(cf, "cytoframe") is not TRUE
gh_pop_get_data
returns a cytoframe, but somehow this pointer is then different from the pointer in the GatingHierachy.
Maybe to understand better what I really want, here my two problems:
the Time variable has a big gap in the data (don't ask why, please), but I still want to plot some markers against the time for quality assurance. Thus, I wanted to create an index variable.
I am going to run FlowSOM on a cell type (e.g. CD3+) from the manual gating and then associate the resulting clusters/SOM nodes with the individual cells. To export them back to FlowJo or to do the comparisons to the manual gating (still need to figure this one out, too).
Thus, I believe I need to work with gatingSet/GatingHierachie and not with the cytoset/frame classes.
Funny enough, when I use lapply, I don't get an error message, but when trying to access the data I get the following error:
> exprs(popDat)
HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
#000: H5Dio.c line 185 in H5Dread(): could not get a validated dataspace from file_space_id
major: Invalid arguments to routine
minor: Bad value
#001: H5S.c line 255 in H5S_get_validated_dataspace(): selection + offset not within extent
major: Dataspace
minor: Out of range
Error in cf_getData(object@pointer) : c++ exception (unknown reason)
(this is all with cytolib_2.1.18 (master, downloaded a few hours after your patch)
Just pointing out that you can bypass the need for exporting to flowJo if you use CytoExploreR to do your manual gating: https://dillonhammill.github.io/CytoExploreR/
@baj12 , first, you are right about the GatingHierarchy
/GatingSet
distinction from cytoframe
/cytoset
. cytoframe
/cytoset
keep track of the core data matrix and its metadata and are essentially the successors to flowFrame
/flowSet
. GatingHierarchy
and GatingSet
add in automatically keeping track of compensation, transformation, and gating in one data class. That cf_unlock
/cs_unlock
mechanism is there because the cytoframe
class carries an internal read-only flag for the on-disk backends to prevent accidental changes being made to the data matrix. Anyway, your call fails because you are making it on a GatingHierarchy
(which you can think of as kind of wrapping around a cytoframe
):
> cf_unlock(gs2[[1]])
Error in cf_unlock(gs2[[1]]) : is(cf, "cytoframe") is not TRUE
If you want to remove the read-only lock on the cytoframe
, you need to call it on the underlying data class (cytoframe
or cytoset
). These should both succeed.
cf_unlock(gh_pop_get_data(gs[[1]]))
cs_unlock(gs_pop_get_data(gs))
If you want to avoid this entirely, you can pass the backend_readonly=FALSE
argument when you call load_gs
or load_cytoset
to load a dataset from disk.
Regarding the lack of error when using lapply
, can you give me a reproducible example or code snippet so I can see what you mean? Also, keep in mind that if you executed your initial snippet before that permissions patch, your flowWorkspaceData
backend files are altered/corrupted, so you should devtools::install_github("RGLab/flowWorkspaceData", force = TRUE)
first to get a blank slate there.
Trying to bypass it with lapply
as you mentioned, but I'm still seeing the appropriate guard on the cytoframe
:
> library(flowCore)
> library(flowWorkspace)
> fcs_files <- list.files(system.file("extdata", package = "flowWorkspaceData"), pattern = "CytoTrol", full.names = TRUE)
> cs_tmp <- load_cytoset_from_fcs(fcs_files)
> # Simulate a saved cytoset that you might not want to automatically alter
> tmp <- tempfile()
> save_cytoset(cs_tmp, tmp)
Done
To reload it, use 'load_cytoset' function
> cs <- load_cytoset(tmp)
> mm <- matrix(ncol=1, 1:nrow(cs[[1]]), dimnames = list(NULL,"idx"))
> lapply(c(1), function(idx){
+ cf_append_cols(cs[[idx]], mm)
+ })
Error in append_cols(cf@pointer, colnames(cols), cols) :
Can't write to the read-only H5CytoFrame object!
Here is my min. example: you don't need the unlock to get this error. As you can see I already cloned the data...
devtools::install_github("RGLab/flowWorkspaceData", force = TRUE)
library(flowCore)
library(flowWorkspace)
library(openCyto)
library(ggcyto)
gs <- load_gs(system.file("extdata/gs_DC_auto", package = "flowWorkspaceData"))
gs_get_pop_paths(gs, path = "auto")
gs2 <- gs_clone(gs)
lapply(gs2, FUN = function(x){
popDat = gh_pop_get_data(x, "Live")
cf_unlock(popDat)
# popDat2 = realize_view(popDat)
mm = matrix(ncol=1, 1:nrow(popDat), dimnames = list(NULL,"idx"))
cf_append_cols(popDat, mm)
})
exprs(gh_pop_get_data(gs2[[1]], "Live"))
I see, so the error is not coming from appending columns to a cytoframe
itself, but rather from the fact that it is already part of a GatingSet
. The issue then arises from gh_pop_get_data
because the GatingSet
does not know about the newly added column. Let me think about how best to handle your situation and work on a fix.
@jacobpwagner, just wanted to confirm whether the following operations are equivalent (i.e. h5_readonly = TRUE
is the same as cs_unlock()
)?
# 1. Turn off readonly flag upon loading
gs <- load_gs("path_to_gs",
h5_readonly = FALSE)
# 2. Load and then turn off readonly flag
gs <- load_gs("path_to_gs")
cs <- gs_cyto_data(gs)
cs_unlock(cs)
gs_cyto_data(gs) <- cs
Oh I see, looks like we should use backend_readonly = TRUE
instead.
Yes, they are equivalent. And cs_unlock
does unlocking in place, so you don't need to assign it back.
Thanks @mikejiang, just wanted to make sure whether unlocking the cytoset
also unlocked it for associated GatingSet
or whether these were being treated independently.
gs tree structure is not controlled by cs lock. And since gs itself is not disk based, so it is safe to do any mutable operations to it unless you explicitly sync to disk by save_gs
Thanks for the clarification @mikejiang.
Thanks for the effort you put into this.
My first problem of adding a new data column to a GatingSet can be "solved" like this:
save_cytoset(gs, "testcyto")
cs = load_cytoset("testcyto")
cs_unlock(cs)
lapply(1:length(cs), FUN = function(x){
mm = matrix(ncol=1, 1:nrow(cs[[x]]), dimnames = list(NULL,"idx"))
cf_append_cols(cs[[x]], mm)
})
Kind of defeats the purpose of the gating set, but I am not modifying the original data, I still get the plots I want to get.
@baj12, just FYI, for the purposes of just getting indices for each row/event, there are already gh_pop_get_indices
and gh_pop_set_indices
.
> library(flowCore)
> library(flowWorkspace)
>
> gs <- load_gs(system.file("extdata", "gs_manual", package = "flowWorkspaceData"))
>
> # boolean mask of events in any given gate, including ungated ("root")
> head(gh_pop_get_indices(gs[[1]], "root"))
[1] TRUE TRUE TRUE TRUE TRUE TRUE
>
> # Which can be easily converted to numerical indices
> head(which(gh_pop_get_indices(gs[[1]], "root")))
[1] 1 2 3 4 5 6
>
> # You can pull out the indices for any subpopulation
> head(which(gh_pop_get_indices(gs[[1]], "CD3+")))
[1] 1 5 6 9 10 15
For adding a column labeling by cluster assignment from FlowSOM, you'll probably just have to make sure every row in the underlying cytoframe
has some index assigned. So for the rows not in the target population, you'd need a dummy index for no assignment or something.
gh_pop_set_indices modifies an existing population. I am not allowed to create one. This is how I think of the internal representation of the GH. Though this might also be calculated on the fly (not sure which it is).
@baj12 . First, my apologies, I found and patched one other issue, so please re-install cytolib
before trying the following.
By pointing to gh_pop_get_indices
, I just meant that you can use the row indices to build your new index column by assigning the appropriate values to the appropriate subpopulation rows. See the example below:
library(flowCore)
library(flowWorkspace)
gs <- load_gs(system.file("extdata/gs_manual", package = "flowWorkspaceData"))
gs_get_pop_paths(gs, path = "auto")
gs2 <- gs_clone(gs)
cs <- cytoset(lapply(gs2, FUN = function(x){
# This cytoframe is really a view into the original cytoframe (using the subpopulation as index)
cf_subpop <- gh_pop_get_data(x, "CD3+")
# The key is that you need to make sure there is a value for every row
# in the entire cytoframe, not just the subpopulation
cd3_pop_rows = gh_pop_get_indices(x, "CD3+")
# Make sure every row gets some value (even if you won't end up using it)
idx_vec <- rep(-1, nrow(x))
# Then add the values you need for your subpopulation (time index or cluster assignment)
idx_vec[cd3_pop_rows] <- 1:nrow(cf_subpop)
mm = matrix(ncol=1, idx_vec, dimnames = list(NULL,"idx"))
# This cytoframe has all of the rows
cf_whole <- gh_pop_get_data(x)
cf_append_cols(cf_whole, mm)
}))
gs_cyto_data(gs2) <- cs
# The subpopulation has whatever indices you gave it
head(exprs(gh_pop_get_data(gs2[[1]], "CD3+")))
# While the other rows just have the filler value
head(exprs(gh_pop_get_data(gs2[[1]])))
With output included:
> library(flowCore)
> library(flowWorkspace)
> gs <- load_gs(system.file("extdata/gs_manual", package = "flowWorkspaceData"))
>
> gs_get_pop_paths(gs, path = "auto")
[1] "root" "not debris" "singlets" "CD3+" "CD4" "CD4/38- DR+"
[7] "CD4/38+ DR+" "CD4/38+ DR-" "CD4/38- DR-" "CD4/CCR7- 45RA+" "CD4/CCR7+ 45RA+" "CD4/CCR7+ 45RA-"
[13] "CD4/CCR7- 45RA-" "CD8" "CD8/38- DR+" "CD8/38+ DR+" "CD8/38+ DR-" "CD8/38- DR-"
[19] "CD8/CCR7- 45RA+" "CD8/CCR7+ 45RA+" "CD8/CCR7+ 45RA-" "CD8/CCR7- 45RA-" "DNT" "DPT"
> gs2 <- gs_clone(gs)
>
> cs <- cytoset(lapply(gs2, FUN = function(x){
+ cf_subpop <- gh_pop_get_data(x, "CD3+")
+ cd3_pop_rows = gh_pop_get_indices(x, "CD3+")
+ idx_vec <- rep(-1, nrow(x))
+ idx_vec[cd3_pop_rows] <- 1:nrow(cf_subpop)
+ mm = matrix(ncol=1, idx_vec, dimnames = list(NULL,"idx"))
+ cf_whole <- gh_pop_get_data(x)
+ cf_append_cols(cf_whole, mm)
+ }))
>
> gs_cyto_data(gs2) <- cs
>
> head(exprs(gh_pop_get_data(gs2[[1]], "CD3+")))
FSC-A FSC-H FSC-W SSC-A <B710-A> <R660-A> <R780-A> <V450-A> <V545-A> <G560-A> <G780-A> Time idx
[1,] 140733.05 133376 69150.98 91113.96 3106.004 3302.719 2073.3540 2996.536 1795.198 2397.642 2860.906 0.002 1
[2,] 127717.88 119616 69974.92 76954.91 1296.811 1931.995 3769.8376 2749.284 1660.914 2883.802 3480.662 0.007 2
[3,] 134347.02 125651 70071.60 70116.48 3128.845 1834.073 1607.8027 2507.708 1727.784 2194.188 1001.111 0.007 3
[4,] 84330.34 77826 71013.20 66052.55 1465.256 1523.407 3702.2285 2657.550 1612.924 2800.220 3415.069 0.015 4
[5,] 101553.95 96602 68895.48 44620.80 2902.931 2458.440 482.8756 2700.848 1561.626 2759.033 3263.364 0.015 5
[6,] 87721.18 79682 72147.98 66181.08 2928.725 1382.240 1510.5111 2500.461 1501.881 2223.328 1044.804 0.020 6
>
> head(exprs(gh_pop_get_data(gs2[[1]])))
FSC-A FSC-H FSC-W SSC-A <B710-A> <R660-A> <R780-A> <V450-A> <V545-A> <G560-A> <G780-A> Time
[1,] 140733.05 133376 69150.98 91113.96 3106.0037 3302.719 2073.3540 2996.5356 1795.1975 2397.6416 2860.9060 0.002
[2,] 26195.32 26207 65506.79 10115.28 696.8729 1394.616 1401.4686 637.6926 749.2054 173.6164 672.4402 0.004
[3,] 64294.02 51594 81667.89 174620.03 1285.8862 1710.609 720.2574 1413.2173 1766.4376 1258.1320 1191.3597 0.006
[4,] 128393.87 103613 81210.08 150625.44 1925.9648 2555.242 1872.6464 1602.2124 3219.9639 1725.0957 1711.5504 0.006
[5,] 127717.88 119616 69974.92 76954.91 1296.8107 1931.995 3769.8376 2749.2837 1660.9139 2883.8020 3480.6624 0.007
[6,] 134347.02 125651 70071.60 70116.48 3128.8455 1834.073 1607.8027 2507.7080 1727.7843 2194.1882 1001.1113 0.007
idx
[1,] 1
[2,] -1
[3,] -1
[4,] -1
[5,] 2
[6,] 3
This workaround is necessary because each GatingHierarchy
uses one underlying data matrix. Subpopulations determined by gating are then represented by indexed views in to that data matrix. So, in order to properly add a column that will be seen by the GatingHierarchy
(or its containing GatingSet
) moving forward, you need to add a complete column to the data matrix.
cf_append_columns
was added with the main use case in mind being appending a column to the entire data matrix, usually before any gating operations have been performed. This subpopulation column addition within the structure of the original GatingSet
is sort of a new use case.
I should also mention that you could just make a copy of the cytoframe
or cytoset
for your subopulation and append to that. If you will not need the GatingSet
for future steps and only care about the target subpopulation, that's a fine option:
> library(flowCore)
> library(flowWorkspace)
> gs <- load_gs(system.file("extdata/gs_manual", package = "flowWorkspaceData"))
>
> gs_get_pop_paths(gs, path = "auto")
[1] "root" "not debris" "singlets" "CD3+" "CD4" "CD4/38- DR+"
[7] "CD4/38+ DR+" "CD4/38+ DR-" "CD4/38- DR-" "CD4/CCR7- 45RA+" "CD4/CCR7+ 45RA+" "CD4/CCR7+ 45RA-"
[13] "CD4/CCR7- 45RA-" "CD8" "CD8/38- DR+" "CD8/38+ DR+" "CD8/38+ DR-" "CD8/38- DR-"
[19] "CD8/CCR7- 45RA+" "CD8/CCR7+ 45RA+" "CD8/CCR7+ 45RA-" "CD8/CCR7- 45RA-" "DNT" "DPT"
> gs2 <- gs_clone(gs)
>
> # Just pull the data you need
> cf <- realize_view(gh_pop_get_data(gs2[[1]]))
>
> # And append to that
> mm = matrix(ncol=1, 1:nrow(cf), dimnames = list(NULL,"idx"))
> cf_append_cols(cf, mm)
> head(exprs(cf))
FSC-A FSC-H FSC-W SSC-A <B710-A> <R660-A> <R780-A> <V450-A> <V545-A> <G560-A> <G780-A> Time
[1,] 140733.05 133376 69150.98 91113.96 3106.0037 3302.719 2073.3540 2996.5356 1795.1975 2397.6416 2860.9060 0.002
[2,] 26195.32 26207 65506.79 10115.28 696.8729 1394.616 1401.4686 637.6926 749.2054 173.6164 672.4402 0.004
[3,] 64294.02 51594 81667.89 174620.03 1285.8862 1710.609 720.2574 1413.2173 1766.4376 1258.1320 1191.3597 0.006
[4,] 128393.87 103613 81210.08 150625.44 1925.9648 2555.242 1872.6464 1602.2124 3219.9639 1725.0957 1711.5504 0.006
[5,] 127717.88 119616 69974.92 76954.91 1296.8107 1931.995 3769.8376 2749.2837 1660.9139 2883.8020 3480.6624 0.007
[6,] 134347.02 125651 70071.60 70116.48 3128.8455 1834.073 1607.8027 2507.7080 1727.7843 2194.1882 1001.1113 0.007
idx
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
That way you can use the GatingSet
for the transformation and gating operations, but you can always pull out a copy of whatever subset of data, which is similar to getting a legacy R-level flowFrame
out from a subpopulation.
@baj12 , with https://github.com/RGLab/flowWorkspace/commit/b48c8820f86328fd640271b5ab9db32a5b89c274 I added a guard to force the user to explicitly call realize_view
on a subsetted cytoframe before appending columns. This means that for the workaround I provided you, this line:
cf_whole <- gh_pop_get_data(x)
will need to change to this
cf_whole <- realize_view(gh_pop_get_data(x))
problem:
` library(flowWorkspace)
path <- system.file("extdata",package="flowWorkspaceData") gs_path <- list.files(path, pattern = "gs_manual",full = TRUE) gs <- load_gs(gs_path)
popDat = gh_pop_get_data(gs[[1]], "CD3+") dat = exprs(popDat) mm = matrix(ncol=1, 1:nrow(popDat), dimnames = list(NULL,"idx")) cf_append_cols(popDat, mm) dat2 = exprs(popDat) `
=>
` HDF5-DIAG: Error detected in HDF5 (1.10.6) thread 0:
000: H5Dio.c line 185 in H5Dread(): could not get a validated dataspace from file_space_id
001: H5S.c line 255 in H5S_get_validated_dataspace(): selection + offset not within extent
Error in cf_getData(object@pointer) : c++ exception (unknown reason)
`
sessionInfo: `
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] ggcyto_1.16.0 ncdfFlow_2.34.0 BH_1.72.0-3 RcppArmadillo_0.9.900.2.0 flowCore_2.1.2
[6] ggplot2_3.3.2 CytoML_2.1.11 flowWorkspace_4.1.9
loaded via a namespace (and not attached): [1] tidyselect_1.1.0 xfun_0.16 purrr_0.3.4 lattice_0.20-41 colorspace_1.4-1 vctrs_0.3.4
[7] generics_0.0.2 stats4_4.0.2 yaml_2.2.1 base64enc_0.1-3 XML_3.99-0.5 RBGL_1.64.0
[13] rlang_0.4.7 hexbin_1.28.1 pillar_1.4.6 withr_2.2.0 glue_1.4.2 aws.s3_0.3.21
[19] Rgraphviz_2.32.0 BiocGenerics_0.34.0 RColorBrewer_1.1-2 matrixStats_0.56.0 jpeg_0.1-8.1 lifecycle_0.2.0
[25] plyr_1.8.6 zlibbioc_1.34.0 RProtoBufLib_2.1.0 munsell_0.5.0 gtable_0.3.0 cytolib_2.1.18
[31] latticeExtra_0.6-29 Biobase_2.48.0 knitr_1.29 parallel_4.0.2 curl_4.3 Rcpp_1.0.5
[37] scales_1.1.1 S4Vectors_0.26.1 jsonlite_1.7.0 RcppParallel_5.0.2 graph_1.66.0 gridExtra_2.3
[43] png_0.1-7 digest_0.6.25 dplyr_1.0.2 grid_4.0.2 tools_4.0.2 magrittr_1.5
[49] tibble_3.0.3 crayon_1.3.4 aws.signature_0.6.0 pkgconfig_2.0.3 ellipsis_0.3.1 data.table_1.13.0
[55] xml2_1.3.2 httr_1.4.2 rstudioapi_0.11 R6_2.4.1 compiler_4.0.2
`
Description: I am in the process of my old R code around FlowSOM to work with flowWorkspace and the other packages that you guys are developing. So this is probably the first of many such messages.
The data I am looking at has a big gap in the time axis, which creates a "hole" in the plots. I would like to create a new column with simply the index of the rows (ordered by the time variable if possible). To do so I wanted to add a column using cf_append_cols but then I ran into the problem above. I.e. I am creating a simple matrix and would like to add it, pretty much the same way it is described in the Help example.
Any help is very much appreciated.
Best,
Bernd