RGLab / flowWorkspace

flowWorkspace
GNU Affero General Public License v3.0
44 stars 21 forks source link

cytoset/cytoframe questions #294

Open DillonHammill opened 4 years ago

DillonHammill commented 4 years ago

Hi @mikejiang and @jacobpwagner,

I just had a couple of questions about the new cytoset reference class.

Does this mean that the plan is do away with read.ncdfFlowSet in favour of load_cytoset_from_fcs?

I have noticed that performing operations such as transformations without assignment still modifies the supplied cytoset (which makes sense). For example:

cs <- load_cytoset_from_fcs(files)
trans <- estimateLogicle(cs[[1]], "PE-A")
# no assignment
transform(cs, trans)
# cs is transformed
cs[[1]] 
> cs[[1]]
flowFrame object 'Activation_1.fcs'
with 50000 cells and 18 observables:
                  name        desc    range     minRange maxRange
$P1              FSC-A        <NA> 262143.0    0.0000000 262143.0
$P2              FSC-H        <NA> 262143.0    0.0000000 262143.0
$P3              FSC-W        <NA> 262143.0    0.0000000 262143.0
$P4              SSC-A        <NA> 262143.0    0.0000000 262143.0
$P5              SSC-H        <NA> 262143.0    0.0000000 262143.0
$P6              SSC-W        <NA> 262143.0    0.0000000 262143.0
$P7  Alexa Fluor 488-A         CD8 262143.0 -111.0000000 262143.0
$P8               PE-A         Va2      4.5    0.5353227      4.5
$P9     PE-Texas Red-A        <NA> 262143.0 -111.0000000 262143.0
$P10           7-AAD-A        CD69 262143.0 -111.0000000 262143.0
$P11          PE-Cy7-A        <NA> 262143.0 -111.0000000 262143.0
$P12 Alexa Fluor 405-A Hoechst-405 262143.0 -111.0000000 262143.0
$P13 Alexa Fluor 430-A Hoechst-430 262143.0  -77.5199966 262143.0
$P14        Qdot 605-A        <NA> 262143.0 -111.0000000 262143.0
$P15 Alexa Fluor 647-A        CD44 262143.0 -111.0000000 262143.0
$P16 Alexa Fluor 700-A         CD4 262143.0 -111.0000000 262143.0
$P17         APC-Cy7-A       CD11c 262143.0 -111.0000000 262143.0
$P18              Time        <NA> 262143.0    0.0000000 262143.0
263 keywords are stored in the 'description' slot

How do I interact with data without making these sorts of changes to the underlying data? Do I have to coerce the cytoset to a flowSet using cytoset_to_flowSet for all internal manipulations? I have noticed that cytoset_to_flowSet accepts flowSet objects and leaves them unaltered.

fs <- cytoset_to_flowSet(files)
cytoset_to_flowSet(fs)
A flowSet with 33 experiments.

An object of class 'AnnotatedDataFrame'
  rowNames: Activation_1.fcs Activation_10.fcs ...
    Activation_9.fcs (33 total)
  varLabels: name
  varMetadata: labelDescription

  column names:
  FSC-A FSC-H FSC-W SSC-A SSC-H SSC-W Alexa Fluor 488-A PE-A PE-Texas Red-A 7-AAD-A PE-Cy7-A Alexa Fluor 405-A Alexa Fluor 430-A Qdot 605-A Alexa Fluor 647-A Alexa Fluor 700-A APC-Cy7-A Time

Should I include this conversion in all of my existing flowSet methods to ensure that I am not altering the underlying data? Presumably I would have to do the same for my flowFrame methods as well using cytoframe_to_flowFrame.

As far as I can tell cytoframe objects are only ever going to appear (at the moment) when reading in a single file. As subsetting a cytoset still returns a flowFrame object. Is this correct? Presumably cytoframes will replace flowFrames eventually?

Is there a clone method available for cytoset and/or cytoframe objects?

Just out of curiosity, what are the benefits of cytosets versus flowSets or ncdfFlowSets?

Sorry for all the questions.

Much appreciated,

Dillon

DillonHammill commented 4 years ago

Aaaah, everything makes sense now...

Is there a reason why subsetting a cytoset with negative integers does not work?

> # cs is a cytoset with 6 samples
> cs[-6]
Error in subset_cytoset(x@pointer, i, j) : index error
> traceback()
4: stop(list(message = "index error", call = subset_cytoset(x@pointer, 
       i, j), cppstack = list(file = "", line = -1L, stack = "C++ stack not available on this system")))
3: subset_cytoset(x@pointer, i, j)
2: cs[-6]
1: cs[-6]

Dillon

jacobpwagner commented 4 years ago

@DillonHammill , not really a good reason for it not working other than it not being written. This should be an easy addition and I'll get on it.

jacobpwagner commented 4 years ago

294 adds the option for negative subscripts. With regard to your original question about information, I'm in the process of a overhauling the documentation on the core packages and organizing them in website form. For now though, a lot of your questions are covered in the section I added to the flowWorkspace-Introduction vignette (under "The cytoframe and cytoset classes") with some examples. But I can address your questions briefly here:

  1. read.ncdfFlowset will still be available, but most of our code will be switching towards using the cytoframe and cytoset classes.
  2. Admittedly, switching to using the reference classes, cytoframe and cytoset, instead of flowFrame and flowSet, will take some adjustment. If you wish to avoid any changes to the original cytoframe objects when performing operations, you can always make a deep copy before (see that section in the vignette). Although it should work as a short-term solution, I would recommend avoiding adding cytoframe_to_flowFrame and cytoset_to_flowSet everywhere and instead refactoring a bit to only perform full copies where it is truly necessary. Generally avoiding costly and unnecessary copies by default makes for more efficient operations, which was part of the rationale for this change.
  3. Yes, the default for [[ extraction from a cytoset is currently still a flowFrame, but it first goes through a cytoframe and then cytoFrame_to_flowFrame. This is mostly due to compatibility with the current state of openCyto, but the default return type will probably switch to cytoframe as soon as we can implement necessary downstream changes. If you want to get ahead of that as you rework your code, sticking with the cytoframe is an option with returnType, e.g. cs[[1, returnType = "cytoframe"]] or with get_cytoframe_from_cs().
  4. For a clone/deep copy, you use realize_view(cf/cs) (we could maybe make a deep_copy alias). For a shallow copy, it's just shallow_copy(cf/cs).
  5. With regard to your question about benefits of cytoframe/cytoset versus ncdfFlowset, one of the important aspects is building out a consistent abstraction layer that can handle multiple backend storage modes so it will be extensible beyond HDF5.

Happy to answer more questions and apologies for the headaches these changes may cause for CytoExploreR, but it will be better in the long run.

DillonHammill commented 4 years ago

Thanks @jacobpwagner.

That's OK. I will work through making the necessary changes so that CytoExploreR is on par the RGLab suite and then I will re-submit to ROpenSci. The main difference for me is that the underlying data is changed whenever manipulations are made.

Just to clarify, is it better to make a deep or shallow copy of the data within functions? For example, in cases where I am manipulating data for visualisations but not returning the data, is it better to make a deep or shallow copy of the data? Presumably, in cases where the data is returned a deep copy is preferred?

One more question about deep/shallow copies. See below:

# cs is a cytoset
# make shallow copy
cs_shallow_copy <- shallow_copy(cs)
# assign shallow copy to cs
cs <- cs_shallow_copy

Presumably, assignment of shallow copy to cs does not replace the deep copy with the shallow one, but instead replaces the definition of cs with the shallow copy?

I will have a go at trying the negative subsetting later today.

Thanks again!

Dillon

mikejiang commented 4 years ago

To avoid the confusion, shallow_copy is removed.

#reference 
cs1 <- cs

# views
cs2 <- cs[i, j] #subview
cs3 <- cs[, ] #full view

#copy 
cs4 <- realize_view(cs)

See the updated doc https://github.com/RGLab/flowWorkspace/blob/trunk/vignettes/flowWorkspace-Introduction.Rmd#L404-L425

DillonHammill commented 4 years ago

Thanks @mikejiang, so any changes to the view will alter the original cytoset, unless a deep copy is made with realize_view().

DillonHammill commented 4 years ago

Sorry to bug you again @jacobpwagner. Is it possible to add negative channel subsetting for cytosets as well?

# Remove first channel
> cs[, -1]
> Error in copy_view_cytoset(x@pointer) : external pointer is not valid
jacobpwagner commented 4 years ago

Sure. Sorry, I sort of meant to add that as well in the earlier commit and then didn't get around to it. I'll take care of it.

jacobpwagner commented 4 years ago

Just added negative column subsetting support in fd10e742addfe73c92861bccd9dd0b6394efec9b. And no worries about "bugging". It's all really helpful, actually.

DillonHammill commented 4 years ago

Thanks @jacobpwagner, I will let you know if I encounter any other issues.

DillonHammill commented 4 years ago

Is there any way to ship cytoset or GatingSet objects in a data package? My understanding is that the cytoset needs to be constructed locally so that it can find the location on disk. I could never get this to work, just wondering if it is possible?

gfinak commented 4 years ago

Only if you ship the files on disk.

DillonHammill commented 4 years ago

Thanks @gfinak, how would I go about doing this? Using DataPackageR I currently construct a flowSet object which is shipped in the package.

gfinak commented 4 years ago

Probably in inst/extdata and then use system.file() to locate the data in the installed package and load it.

DillonHammill commented 4 years ago

This is what I am currently doing:

files <- list.files(path = DataPackageR::project_extdata_path('Activation'),
                              full.names = TRUE)
Activation <- read.flowSet(files = files)

So replace list.files with system.file call and switch to load_cytoset_from_fcs?

gfinak commented 4 years ago

I think, yes.

DillonHammill commented 4 years ago

Thanks @gfinak I will give it a try.

DillonHammill commented 4 years ago

The cytoset builds fine in DataPackageR but does not open when called after loading the data package...

files <- list.files(system.file("extdata/Activation", package = "CytoExploreRData"), full.names = TRUE)
Activation <- load_cytoset_from_fcs(files)
library(CytoExploreRData)
Activation
Error in get_pheno_data(object@pointer) : external pointer is not valid
DillonHammill commented 4 years ago

Sorry closed by accident. I do have some additional pData() assignments, I will try removing these to see if that works...

jacobpwagner commented 4 years ago

Yeah, see if that fixes it. Just those two lines above complete successfully in my testing.

DillonHammill commented 4 years ago

Still no luck with just these lines...

files <- list.files(system.file("extdata/Activation", package = "CytoExploreRData"), 
full.names = TRUE)
Activation <- load_cytoset_from_fcs(files)
library(CytoExploreRData)
Activation
Error in get_pheno_data(object@pointer) : external pointer is not valid
DillonHammill commented 4 years ago

I have pushed the changes to the devel branch of CytoExploreRData if you would like to try it. The Activation preprocessing script can be located in data-raw as Activation.Rmd.

jacobpwagner commented 4 years ago

Weird. Still cannot reproduce it:

> library(CytoExploreR)
Loading required package: flowCore
Loading required package: flowWorkspace
As part of improvements to flowWorkspace, some behavior of
GatingSet objects has changed. For details, please read the section
titled "The cytoframe and cytoset classes" in the package vignette:

  vignette("flowWorkspace-Introduction", "flowWorkspace")
Loading required package: openCyto
Loading required package: ncdfFlow
Loading required package: RcppArmadillo
Loading required package: BH
> library(CytoExploreRData)
> files <- list.files(system.file("extdata/Activation", package = "CytoExploreRData"), 
+                     full.names = TRUE)
> Activation <- load_cytoset_from_fcs(files)
> Activation
A cytoset with 33 samples.

  column names:
    FSC-A, FSC-H, FSC-W, SSC-A, SSC-H, SSC-W, Alexa Fluor 488-A, PE-A, PE-Texas Red-A, 7-AAD-A, PE-Cy7-A, Alexa Fluor 405-A, Alexa Fluor 430-A, Qdot 605-A, Alexa Fluor 647-A, Alexa Fluor 700-A, APC-Cy7-A, Time

The library(CytoExploreR) and library(CytoExploreRData) lines aren't necessary, but I wanted to make sure it wasn't some sort of weird namespace issue.

jacobpwagner commented 4 years ago

Ah, I see. I was going with your first lines. You just mean this:

> library(CytoExploreRData)
> Activation
Loading required package: flowWorkspace
As part of improvements to flowWorkspace, some behavior of
GatingSet objects has changed. For details, please read the section
titled "The cytoframe and cytoset classes" in the package vignette:

  vignette("flowWorkspace-Introduction", "flowWorkspace")
 Error in get_pheno_data(object@pointer) : external pointer is not valid 
DillonHammill commented 4 years ago

That's correct, the cytoset should be loaded with just the above calls. It seems like the dataset should be constructed when the package is loaded. I think it is trying to load a pre-built cytoset and that is why it is having problems.

mikejiang commented 4 years ago

save_cytoset and load_cytoset are added. see https://github.com/RGLab/flowWorkspace/blob/trunk/vignettes/flowWorkspace-Introduction.Rmd#L467-L473

DillonHammill commented 4 years ago

Thanks @mikejiang.

I thought I might be able to do the following in data-raw/Activation.Rmd. First create the cytoset, save it using save_cytoset. Comment out the code to construct the cytoset and add a line to load the stored cytoset to object called Activation. When building the dataset this will get saved to Activation.rda which will be loaded with the package. This approach still leads to the same error...

I would like to export the cytoset object so that no paths are required to construct the cytoset when loading the data package.

DillonHammill commented 4 years ago

Just a comment on this that I will try when I get the chance. I think the issue is stemming from a lack of access privileges to files when the R package library is located in a restricted directory (such as Program Files).

DillonHammill commented 3 years ago

@mikejiang, I was hoping that you could clarify whether subsetted cytoframes are duplicated on disk or whether they point to the original copy?

In the example below, do cf and cf_sample point to the same data on disk or are there separate copies made for the cf and cf_sample? My understanding is that realize_view() is required to create the copy (right?), otherwise they both point to the same underlying data?

library(CytoExploreRData)
cs <- flowSet_to_cytoset(Activation)
cf <- cs[[1]]
cf_sample <- Subset(cf, sampleFilter(size = 5000))

I just want to make sure that I am not unnecessarily duplicating data.

Thanks for your help!

mikejiang commented 3 years ago

yes, it points to the same on-disk data unless realize it

> class(cf)
[1] "cytoframe"
attr(,"package")
[1] "flowWorkspace"
> nrow(cf)
[1] 119531
> cf_get_uri(cf)
[1] "/tmp/48aee2e9-64c1-4405-bb4e-cec0a2d3354a.h5"
> 
> cf_sample <- Subset(cf, sampleFilter(size = 5000))
> nrow(cf_sample)
[1] 5000
> cf_get_uri(cf_sample)
[1] "/tmp/48aee2e9-64c1-4405-bb4e-cec0a2d3354a.h5"
> 
> cf_cp <- realize_view(cf_sample)
> cf_get_uri(cf_cp)
[1] "/tmp/RtmpiZafH8/file12029c571b520e.h5"
DillonHammill commented 3 years ago

Thanks @mikejiang, and wrapping the subsetted cytoframe in a cytoset also points to the same data. That's really cool!

cs_new <- cytoset(
  structure(
    lapply(seq_along(cs), function(z){
      Subset(cs[[z]], sampleFilter(size = 5000))
    }),
    names = sampleNames(cs)
    )
  )
cf_get_uri(cs[[1]])
cf_get_uri(cs_new[[1]]) # same as above
nmheller commented 1 year ago

I think this error I'm getting might be related to this thread. I'm working through the "Compensation of Fluorescent Spillover" vignette in CytoExploreR but I'm not getting very far. I can't solve it because I am a beginner with R and trying to learn CytoExploreR:

[Workspace loaded from ~/R folder/CytoExploreR-Demo/.RData]

Loading required package: flowWorkspace As part of improvements to flowWorkspace, some behavior of GatingSet objects has changed. For details, please read the section titled "The cytoframe and cytoset classes" in the package vignette:

vignette("flowWorkspace-Introduction", "flowWorkspace") Error in get_cytoset(x@pointer) : external pointer is not valid

Any help would be appreciated. Thank you.

mikejiang commented 1 year ago

gaining set has to be loaded from disk via load_gs() function in order to de-serialize C++ data structure, merely restore R object from .RData won't be sufficient.

nmheller commented 1 year ago

Thank you very much for the help. I will give it a try.