RGLab / flowWorkspace

flowWorkspace
GNU Affero General Public License v3.0
44 stars 21 forks source link

allow for re-sizing of the data matrix at the level of `cytoframe::exprs<-` or `CytoFrameView::set_data()` #321

Open mikejiang opened 4 years ago

mikejiang commented 4 years ago

Specifically, I think that we will have to allow for re-sizing of the data matrix at the level of cytoframe::exprs<- or CytoFrameView::set_data(). But as these are not frequently used APIs, they can be made lower priority.

Originally posted by @jacobpwagner in https://github.com/RGLab/flowWorkspace/issues/275#issuecomment-454096843

mikejiang commented 4 years ago

This requires some more careful thinking/discussion since currently cytoframe (i.e. CytoFrameView class at c++ level) prohibits changing the size of matrix separately due to the consistency checks between meta data (i.e. channels/markers) and data.

jacobpwagner commented 4 years ago

Related to this, I just added a method to append new named columns to the data matrix and appropriately update keywords and param metadata (for MemCytoFrame): https://github.com/RGLab/cytolib/commit/b7dd39e4cde275770593c4524908ae06cd500bc9

This was originally motivated by the need to re-add uncompensated columns when parsing FlowJo workspaces: https://github.com/RGLab/CytoML/issues/89

However, since it was added, I switched flowWorkspace::cf_append_cols over to using this when the cytoframe is in-memory as opposed to h5: https://github.com/RGLab/flowWorkspace/commit/e87fbdfa3508a44f14ece397130944213713ee03

Rather than adding more of these mutators slowly as they become necessary, it is probably a good idea to work through adding the following now:

I will track any work on that front on this issue.

Along with these, we need to make sure to consistently determine the way that instrument ranges are set for the newly added columns, as well as how $Pn indices are handled upon addition or removal of columns. flowWorkspace::cytoframe objects point to cytolib::CytoFrameView objects and so can always represent a windowed subset. Thus, we need to decide if we should realize the view to the original cytolib::CytoFrame when methods like cf_append_columns are called as this will also determine where the new indices start according to the index "bump-down" logic of cytolib::CytoFrame::subset_parameters: https://github.com/RGLab/cytolib/blob/6e903fb3de917fa35f71997ba9e97c6d4f0cee78/src/CytoFrame.cpp#L296-L312

https://github.com/RGLab/flowCore/pull/158 shows some potential complications that can arise if assumptions are not consistent for index-tagged keys.

Currently, I have set cf_append_cols to append to a copy generated from realize_view to match the prior behavior: https://github.com/RGLab/flowWorkspace/blob/d0cb5abaff43b1db9df1c5b2ecb99f1691ddfafb/R/cytoframe.R#L835-L838 But as cytoframe is supposed to be a reference data structure, we should probably eventually change this to realize the view to the original cytolib::CytoFrame and then append the columns.

jacobpwagner commented 4 years ago

I have included example results of appending columns to a flowFrame, MemCytoFrame, and H5CytoFrame after https://github.com/RGLab/flowWorkspace/commit/e87fbdfa3508a44f14ece397130944213713ee03. The prior logic of cf_append_cols (which is still the logic for H5CytoFrame) is: 1) Convert the cytoframe to flowFrame 2) Add the columns in R using fr_append_cols which also does some pdata and keyword updates based on the new columns. 3) Convert it back using flowFrame_to_cytoframe, which writes the flowFrame out to FCS (using write.FCS) and then reads it back in with load_cytoframe_from_fcs

In moving the column -addition logic down to cytolib, I generally mimicked the logic of fr_append_cols, which is apparent from the results. The main difference is that I added in the $PnR keyword following the logic of https://github.com/RGLab/flowCore/issues/187 and https://github.com/RGLab/flowCore/commit/654f0c36e61ebc197bd79501761318e174c67cf7.

> library(flowCore)
> library(flowWorkspace)
> 
> fcs_path <- system.file("extdata", "CytoTrol_CytoTrol_1.fcs", package = "flowWorkspaceData")
> cf <- load_cytoframe_from_fcs(fcs_path)
> 
> cf_h5 <- load_cytoframe_from_fcs(fcs_path, is_h5 = TRUE)
> 
> fr <- read.FCS(fcs_path)
> 
> 
> # Just append a copy of a few columns with a new name
> to_append <- exprs(cf)[,c(3,4)]
> 
> colnames(to_append) <- c("new_col_1", "new_col_2")
> cf <- cf_append_cols(cf, to_append)
> cf_h5 <- cf_append_cols(cf_h5, to_append)
> fr_expanded <- fr_append_cols(fr, to_append)
> 
> cf
cytoframe object 'CytoTrol_CytoTrol_1.fcs'
with 119531 cells and 14 observables:
          name         desc    range minRange maxRange
$P1      FSC-A         <NA> 262143.0     0.00 262143.0
$P2      FSC-H         <NA> 262143.0     0.00 262143.0
$P3      FSC-W         <NA> 262143.0     0.00 262143.0
$P4      SSC-A         <NA> 262143.0     0.00 262143.0
$P5     B710-A  CD4 PcpCy55 262143.0  -111.00 262143.0
$P6     R660-A     CD38 APC 262143.0  -111.00 262143.0
$P7     R780-A    CD8 APCH7 262143.0  -111.00 262143.0
$P8     V450-A     CD3 V450 262143.0  -111.00 262143.0
$P9     V545-A  HLA-DR V500 262143.0  -111.00 262143.0
$P10    G560-A      CCR7 PE 262143.0  -111.00 262143.0
$P11    G780-A CD45RA PECy7 262143.0  -111.00 262143.0
$P12      Time         <NA> 262143.0     0.00 262143.0
$P13 new_col_1         <NA> 202467.1 63886.93 202467.1
$P14 new_col_2         <NA> 262143.0  3622.92 262143.0
207 keywords are stored in the 'description' slot
> cf_h5
cytoframe object 'file34e0444a0fa6'
with 119531 cells and 14 observables:
          name         desc  range minRange maxRange
$P1      FSC-A         <NA> 262143        0   262143
$P2      FSC-H         <NA> 262143        0   262143
$P3      FSC-W         <NA> 262143        0   262143
$P4      SSC-A         <NA> 262143        0   262143
$P5     B710-A  CD4 PcpCy55 262143     -111   262143
$P6     R660-A     CD38 APC 262143     -111   262143
$P7     R780-A    CD8 APCH7 262143     -111   262143
$P8     V450-A     CD3 V450 262143     -111   262143
$P9     V545-A  HLA-DR V500 262143     -111   262143
$P10    G560-A      CCR7 PE 262143     -111   262143
$P11    G780-A CD45RA PECy7 262143     -111   262143
$P12      Time         <NA> 262143        0   262143
$P13 new_col_1         <NA> 202468        0   202468
$P14 new_col_2         <NA> 262143        0   262143
207 keywords are stored in the 'description' slot
> fr_expanded
flowFrame object '7817b649-f92d-4103-bd46-6364fdbe85db'
with 119531 cells and 14 observables:
          name         desc    range minRange maxRange
$P1      FSC-A         <NA> 262144.0     0.00 262143.0
$P2      FSC-H         <NA> 262144.0     0.00 262143.0
$P3      FSC-W         <NA> 262144.0     0.00 262143.0
$P4      SSC-A         <NA> 262144.0     0.00 262143.0
$P5     B710-A  CD4 PcpCy55 262144.0  -111.00 262143.0
$P6     R660-A     CD38 APC 262144.0  -111.00 262143.0
$P7     R780-A    CD8 APCH7 262144.0  -111.00 262143.0
$P8     V450-A     CD3 V450 262144.0  -111.00 262143.0
$P9     V545-A  HLA-DR V500 262144.0  -111.00 262143.0
$P10    G560-A      CCR7 PE 262144.0  -111.00 262143.0
$P11    G780-A CD45RA PECy7 262144.0  -111.00 262143.0
$P12      Time         <NA> 262144.0     0.00 262143.0
$P13 new_col_1         <NA> 138581.2 63886.93 202467.1
$P14 new_col_2         <NA> 258521.1  3622.92 262143.0
198 keywords are stored in the 'description' slot
> range(cf, "instrument")
     FSC-A  FSC-H  FSC-W  SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A   Time new_col_1 new_col_2
min      0      0      0      0   -111   -111   -111   -111   -111   -111   -111      0  63886.93   3622.92
max 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 202467.09 262143.00
> range(cf_h5, "instrument")
     FSC-A  FSC-H  FSC-W  SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A   Time new_col_1 new_col_2
min      0      0      0      0   -111   -111   -111   -111   -111   -111   -111      0         0         0
max 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143    202468    262143
> range(fr_expanded, "instrument")
     FSC-A  FSC-H  FSC-W  SSC-A B710-A R660-A R780-A V450-A V545-A G560-A G780-A   Time new_col_1 new_col_2
min      0      0      0      0   -111   -111   -111   -111   -111   -111   -111      0  63886.93   3622.92
max 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 262143 202467.09 262143.00
> 
> range(cf, "data")
     FSC-A  FSC-H     FSC-W     SSC-A   B710-A    R660-A    R780-A V450-A   V545-A   G560-A    G780-A    Time new_col_1 new_col_2
min  24601  25008  63886.93   3622.92   -352.8   -714.95  -1133.65   -171   -264.1   -278.4   -435.84     0.2  63886.93   3622.92
max 262143 258223 202467.09 262143.00 262143.0 262143.00 262143.00 262143 262143.0 262143.0 262143.00 29539.2 202467.09 262143.00
> range(cf_h5, "data")
     FSC-A  FSC-H     FSC-W     SSC-A   B710-A    R660-A    R780-A V450-A   V545-A   G560-A    G780-A    Time new_col_1 new_col_2
min  24601  25008  63886.93   3622.92   -352.8   -714.95  -1133.65   -171   -264.1   -278.4   -435.84     0.2  63886.93   3622.92
max 262143 258223 202467.09 262143.00 262143.0 262143.00 262143.00 262143 262143.0 262143.0 262143.00 29539.2 202467.09 262143.00
> range(fr_expanded, "data")
     FSC-A  FSC-H     FSC-W     SSC-A   B710-A    R660-A    R780-A V450-A   V545-A   G560-A    G780-A    Time new_col_1 new_col_2
min  24601  25008  63886.93   3622.92   -352.8   -714.95  -1133.65   -171   -264.1   -278.4   -435.84     0.2  63886.93   3622.92
max 262143 258223 202467.09 262143.00 262143.0 262143.00 262143.00 262143 262143.0 262143.0 262143.00 29539.2 202467.09 262143.00
> 
> keyword(cf, "$P13R")
$`$P13R`
[1] "202469"

> keyword(cf_h5, "$P13R")
$`$P13R`
[1] "202469"

> keyword(fr_expanded, "$P13R")
$`$P13R`
NULL

> 
> keyword(cf, "flowCore_$P13Rmin")
$`flowCore_$P13Rmin`
NULL

> keyword(cf_h5, "flowCore_$P13Rmin")
$`flowCore_$P13Rmin`
NULL

> keyword(fr_expanded, "flowCore_$P13Rmin")
$`flowCore_$P13Rmin`
NULL

> 
> keyword(cf, "flowCore_$P13Rmax")
$`flowCore_$P13Rmax`
NULL

> keyword(cf_h5, "flowCore_$P13Rmax")
$`flowCore_$P13Rmax`
NULL

> keyword(fr_expanded, "flowCore_$P13Rmax")
$`flowCore_$P13Rmax`
NULL

It is easy to change the determination of instrument min/max assigned to the channels and default keyword values if necessary.