Open mikejiang opened 4 years ago
This requires some more careful thinking/discussion since currently cytoframe
(i.e. CytoFrameView
class at c++ level) prohibits changing the size of matrix separately due to the consistency checks between meta data (i.e. channels/markers) and data.
Related to this, I just added a method to append new named columns to the data matrix and appropriately update keywords and param metadata (for MemCytoFrame
): https://github.com/RGLab/cytolib/commit/b7dd39e4cde275770593c4524908ae06cd500bc9
This was originally motivated by the need to re-add uncompensated columns when parsing FlowJo workspaces: https://github.com/RGLab/CytoML/issues/89
However, since it was added, I switched flowWorkspace::cf_append_cols
over to using this when the cytoframe
is in-memory as opposed to h5: https://github.com/RGLab/flowWorkspace/commit/e87fbdfa3508a44f14ece397130944213713ee03
Rather than adding more of these mutators slowly as they become necessary, it is probably a good idea to work through adding the following now:
MemCytoFrame::remove_columns
MemCytoFrame::append_rows
(rbind
)MemCytoFrame::remove_rows
H5CytoFrame
(and TileCytoFrame
) versions of all of the aboveI will track any work on that front on this issue.
Along with these, we need to make sure to consistently determine the way that instrument ranges are set for the newly added columns, as well as how $Pn
indices are handled upon addition or removal of columns. flowWorkspace::cytoframe
objects point to cytolib::CytoFrameView
objects and so can always represent a windowed subset. Thus, we need to decide if we should realize the view to the original cytolib::CytoFrame
when methods like cf_append_columns
are called as this will also determine where the new indices start according to the index "bump-down" logic of cytolib::CytoFrame::subset_parameters
:
https://github.com/RGLab/cytolib/blob/6e903fb3de917fa35f71997ba9e97c6d4f0cee78/src/CytoFrame.cpp#L296-L312
https://github.com/RGLab/flowCore/pull/158 shows some potential complications that can arise if assumptions are not consistent for index-tagged keys.
Currently, I have set cf_append_cols
to append to a copy generated from realize_view
to match the prior behavior:
https://github.com/RGLab/flowWorkspace/blob/d0cb5abaff43b1db9df1c5b2ecb99f1691ddfafb/R/cytoframe.R#L835-L838
But as cytoframe
is supposed to be a reference data structure, we should probably eventually change this to realize the view to the original cytolib::CytoFrame
and then append the columns.
I have included example results of appending columns to a flowFrame
, MemCytoFrame
, and H5CytoFrame
after https://github.com/RGLab/flowWorkspace/commit/e87fbdfa3508a44f14ece397130944213713ee03. The prior logic of cf_append_cols
(which is still the logic for H5CytoFrame
) is:
1) Convert the cytoframe
to flowFrame
2) Add the columns in R using fr_append_cols
which also does some pdata and keyword updates based on the new columns.
3) Convert it back using flowFrame_to_cytoframe
, which writes the flowFrame
out to FCS (using write.FCS
) and then reads it back in with load_cytoframe_from_fcs
In moving the column -addition logic down to cytolib
, I generally mimicked the logic of fr_append_cols
, which is apparent from the results. The main difference is that I added in the $PnR
keyword following the logic of https://github.com/RGLab/flowCore/issues/187 and https://github.com/RGLab/flowCore/commit/654f0c36e61ebc197bd79501761318e174c67cf7.
> library(flowCore)
> library(flowWorkspace)
>
> fcs_path <- system.file("extdata", "CytoTrol_CytoTrol_1.fcs", package = "flowWorkspaceData")
> cf <- load_cytoframe_from_fcs(fcs_path)
>
> cf_h5 <- load_cytoframe_from_fcs(fcs_path, is_h5 = TRUE)
>
> fr <- read.FCS(fcs_path)
>
>
> # Just append a copy of a few columns with a new name
> to_append <- exprs(cf)[,c(3,4)]
>
> colnames(to_append) <- c("new_col_1", "new_col_2")
> cf <- cf_append_cols(cf, to_append)
> cf_h5 <- cf_append_cols(cf_h5, to_append)
> fr_expanded <- fr_append_cols(fr, to_append)
>
> cf
cytoframe object 'CytoTrol_CytoTrol_1.fcs'
with 119531 cells and 14 observables:
name desc range minRange maxRange
$P1 FSC-A <NA> 262143.0 0.00 262143.0
$P2 FSC-H <NA> 262143.0 0.00 262143.0
$P3 FSC-W <NA> 262143.0 0.00 262143.0
$P4 SSC-A <NA> 262143.0 0.00 262143.0
$P5 B710-A CD4 PcpCy55 262143.0 -111.00 262143.0
$P6 R660-A CD38 APC 262143.0 -111.00 262143.0
$P7 R780-A CD8 APCH7 262143.0 -111.00 262143.0
$P8 V450-A CD3 V450 262143.0 -111.00 262143.0
$P9 V545-A HLA-DR V500 262143.0 -111.00 262143.0
$P10 G560-A CCR7 PE 262143.0 -111.00 262143.0
$P11 G780-A CD45RA PECy7 262143.0 -111.00 262143.0
$P12 Time <NA> 262143.0 0.00 262143.0
$P13 new_col_1 <NA> 202467.1 63886.93 202467.1
$P14 new_col_2 <NA> 262143.0 3622.92 262143.0
207 keywords are stored in the 'description' slot
> cf_h5
cytoframe object 'file34e0444a0fa6'
with 119531 cells and 14 observables:
name desc range minRange maxRange
$P1 FSC-A <NA> 262143 0 262143
$P2 FSC-H <NA> 262143 0 262143
$P3 FSC-W <NA> 262143 0 262143
$P4 SSC-A <NA> 262143 0 262143
$P5 B710-A CD4 PcpCy55 262143 -111 262143
$P6 R660-A CD38 APC 262143 -111 262143
$P7 R780-A CD8 APCH7 262143 -111 262143
$P8 V450-A CD3 V450 262143 -111 262143
$P9 V545-A HLA-DR V500 262143 -111 262143
$P10 G560-A CCR7 PE 262143 -111 262143
$P11 G780-A CD45RA PECy7 262143 -111 262143
$P12 Time <NA> 262143 0 262143
$P13 new_col_1 <NA> 202468 0 202468
$P14 new_col_2 <NA> 262143 0 262143
207 keywords are stored in the 'description' slot
> fr_expanded
flowFrame object '7817b649-f92d-4103-bd46-6364fdbe85db'
with 119531 cells and 14 observables:
name desc range minRange maxRange
$P1 FSC-A <NA> 262144.0 0.00 262143.0
$P2 FSC-H <NA> 262144.0 0.00 262143.0
$P3 FSC-W <NA> 262144.0 0.00 262143.0
$P4 SSC-A <NA> 262144.0 0.00 262143.0
$P5 B710-A CD4 PcpCy55 262144.0 -111.00 262143.0
$P6 R660-A CD38 APC 262144.0 -111.00 262143.0
$P7 R780-A CD8 APCH7 262144.0 -111.00 262143.0
$P8 V450-A CD3 V450 262144.0 -111.00 262143.0
$P9 V545-A HLA-DR V500 262144.0 -111.00 262143.0
$P10 G560-A CCR7 PE 262144.0 -111.00 262143.0
$P11 G780-A CD45RA PECy7 262144.0 -111.00 262143.0
$P12 Time <NA> 262144.0 0.00 262143.0
$P13 new_col_1 <NA> 138581.2 63886.93 202467.1
$P14 new_col_2 <NA> 258521.1 3622.92 262143.0
198 keywords are stored in the 'description' slot
> range(cf, "instrument")
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time new_col_1 new_col_2
min 0 0 0 0 -111 -111 -111 -111 -111 -111 -111 0 63886.93 3622.92
max 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 202467.09 262143.00
> range(cf_h5, "instrument")
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time new_col_1 new_col_2
min 0 0 0 0 -111 -111 -111 -111 -111 -111 -111 0 0 0
max 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 202468 262143
> range(fr_expanded, "instrument")
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time new_col_1 new_col_2
min 0 0 0 0 -111 -111 -111 -111 -111 -111 -111 0 63886.93 3622.92
max 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 202467.09 262143.00
>
> range(cf, "data")
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time new_col_1 new_col_2
min 24601 25008 63886.93 3622.92 -352.8 -714.95 -1133.65 -171 -264.1 -278.4 -435.84 0.2 63886.93 3622.92
max 262143 258223 202467.09 262143.00 262143.0 262143.00 262143.00 262143 262143.0 262143.0 262143.00 29539.2 202467.09 262143.00
> range(cf_h5, "data")
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time new_col_1 new_col_2
min 24601 25008 63886.93 3622.92 -352.8 -714.95 -1133.65 -171 -264.1 -278.4 -435.84 0.2 63886.93 3622.92
max 262143 258223 202467.09 262143.00 262143.0 262143.00 262143.00 262143 262143.0 262143.0 262143.00 29539.2 202467.09 262143.00
> range(fr_expanded, "data")
FSC-A FSC-H FSC-W SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A Time new_col_1 new_col_2
min 24601 25008 63886.93 3622.92 -352.8 -714.95 -1133.65 -171 -264.1 -278.4 -435.84 0.2 63886.93 3622.92
max 262143 258223 202467.09 262143.00 262143.0 262143.00 262143.00 262143 262143.0 262143.0 262143.00 29539.2 202467.09 262143.00
>
> keyword(cf, "$P13R")
$`$P13R`
[1] "202469"
> keyword(cf_h5, "$P13R")
$`$P13R`
[1] "202469"
> keyword(fr_expanded, "$P13R")
$`$P13R`
NULL
>
> keyword(cf, "flowCore_$P13Rmin")
$`flowCore_$P13Rmin`
NULL
> keyword(cf_h5, "flowCore_$P13Rmin")
$`flowCore_$P13Rmin`
NULL
> keyword(fr_expanded, "flowCore_$P13Rmin")
$`flowCore_$P13Rmin`
NULL
>
> keyword(cf, "flowCore_$P13Rmax")
$`flowCore_$P13Rmax`
NULL
> keyword(cf_h5, "flowCore_$P13Rmax")
$`flowCore_$P13Rmax`
NULL
> keyword(fr_expanded, "flowCore_$P13Rmax")
$`flowCore_$P13Rmax`
NULL
It is easy to change the determination of instrument min/max assigned to the channels and default keyword values if necessary.
Specifically, I think that we will have to allow for re-sizing of the data matrix at the level of
cytoframe::exprs<-
orCytoFrameView::set_data()
. But as these are not frequently used APIs, they can be made lower priority.Originally posted by @jacobpwagner in https://github.com/RGLab/flowWorkspace/issues/275#issuecomment-454096843