RGLab / flowWorkspace

flowWorkspace
GNU Affero General Public License v3.0
44 stars 21 forks source link

Replacing sampleNames of cytoset #335

Closed DillonHammill closed 4 years ago

DillonHammill commented 4 years ago

Hi @mikejiang & @jacobpwagner,

Looks like replacement of sampleNames at the cytoset level is not inherited correctly at the flowFrame/cytoframe level. The sampleNames replacement is being set as the rownames in the GUID slot, it doesn't actually replace the GUID slot entries. See example below:

> cs
A cytoset with 4 samples.

  column names:
    FSC-A, FSC-H, FSC-W, SSC-A, SSC-H, SSC-W, Alexa Fluor 488-A, PE-A, PE-Texas Red-A, 7-AAD-A, Time

> sampleNames(cs)
 [1] "Activation_1.fcs"  "Activation_2.fcs"  "Activation_3.fcs"  "Activation_4.fcs" 

> sampleNames(cs) <- paste("Test", 1:4)

> sampleNames(cs)
 [1] "Test 1"  "Test 2"  "Test 3"  "Test 4"

> identifier(cs[[1]])
[1] "Activation_1.fcs"

> keyword(cs[[1]], "GUID")
$GUID
[1] "Activation_1.fcs"

> keyword(cs, "GUID")
        GUID               
Test 1  "Activation_1.fcs" 
Test 2  "Activation_2.fcs" 
Test 3  "Activation_3.fcs" 
Test 4  "Activation_4.fcs" 

> rownames(keyword(cs, "GUID"))
 [1] "Test 1"  "Test 2"  "Test 3"  "Test 4" 

Apologies if you have already addressed this, I am still having problems with Rtools 4.0 so I am testing against the development versions of BioC packages.

mikejiang commented 4 years ago

GUID isn't a standard keyword from FCS standard and considered as the legacy information from flowCore/flowFrame. It has never been formally used in cytoverse tool chains and not related to sampleNames of cytoset in any way. (In fact, I lean toward dropping/deprecating it entirely If it is causing confusions) In what context do you need to use it in your workflow?

DillonHammill commented 4 years ago

I often have cases where the sampleNames of the cytoset require changing, it would be great if this information was passed down to the flowFrame/cytoframe in some way (i.e. sampleNames(cs)[1] == identifier(cs[[1]]). A lot of the looping operations are performed at the flowFrame level (since cs[[1]] is currently a flowFrame), this means that I use the cytoset constructor a fair bit to wrap the output. The only thing is that the flowFrame identifiers of the constructed cytoset seem to be changed to the temp file name and I cannot find a way to retain the original identifiers. I was hoping a simple sampleNames replacement would work but I guess not. Please let me know if there is a way to do this.

DillonHammill commented 4 years ago

Here is an example @mikejiang:

library(CytoExploreRData)
library(flowCore)
library(flowWorkspace)

# FLOWSET 
fs <- Activation
sampleNames(fs)
fr_list <- lapply(seq_along(fs), function(x){
  # DO SOMETHING HERE
  return(fs[[x]])
})
names(fr_list) <- sampleNames(fs)
fs_new <- flowSet(fr_list)
sampleNames(fs_new)
identifier(fs_new[[1]])

# CYTOSET
cs <- flowSet_to_cytoset(fs)
sampleNames(cs)
fr_list <- lapply(seq_along(cs), function(x){
  # DO SOMETHING HERE
  return(cs[[x]])
})
names(fr_list) <- sampleNames(cs)
# cytoset constructor requires cytoframes
cf_list <- lapply(fr_list, flowFrame_to_cytoframe)
cs_new <- cytoset(cf_list)
sampleNames(cs_new)
identifier(cs_new[[1]])

# fix the identifiers
lapply(seq_along(cs_new), function(z){
identifier(cs_new[[z]]) <- sampleNames(cs)[z]
})
identifier(cs_new[[1]])

I would like to retain the behaviour of the old flowSet method in which the original identifiers are retained. It seems like a lot more work to achieve the same result with cytosets. Perhaps I can add a custom apply function which can handle the loops and fix the identifiers - otherwise this will be a lot work to track them all down in CytoExploreR. That way if this behaviour does change in the future I only need to change it in one place.

DillonHammill commented 4 years ago

A combination of fsApply() and flowSet_to_cytoset() seems to work:

# FUN is a function to apply to each flowFrame that returns a flowFrame
cs_new <- flowSet_to_cytoset(fsApply(cs, FUN))
sampleNames(cs_new)
identifier(cs_new[[1]])

This could be wrapped into a new csApply() function:

csApply <- function(x, 
                    FUN,
                    ...){

  res <- fsApply(x, FUN = FUN, ...)

  if(is(res, "flowSet")){
    if(is(x, "cytoset")){
      return(flowSet_to_cytoset(res))
    }else{
      return(res)
    }
  }else{
    return(res)
  }

}

I admittedly don't use fsApply() much in CytoExploreR, I generally prefer to use lapply() but I think it is time that I made the change. It does make the code more concise and readable. Thanks for steering me in the right direction @mikejiang. I will implement this under cyto_apply() name for now in case you do decide to add a csApply() function.

mikejiang commented 4 years ago

sampleNames should be only stored at the container level(cytoset), I think to keep the duplicated information at the individual element level (cytoframe ) is a bad design that can easily lead to potential data discrepancy errors.

The way that flowCore maintains the sample name through GUID keyword of flowFrame a legacy behavior and should be deprecated from my opinion.

jacobpwagner commented 4 years ago

Sorry I'm coming in late to this, but @DillonHammill , is there a reason you can't just explicitly ask for a cytoframe back in the body of your lapply call?:

Using your example:

cf_list <- lapply(seq_along(cs), function(x){
  # DO SOMETHING HERE
  return(cs[[x, returnType = "cytoframe"]])
})
names(cf_list) <- sampleNames(cs)
cs_new <- cytoset(cf_list)
sampleNames(cs_new)
 [1] "Activation_1.fcs"  "Activation_2.fcs"  "Activation_3.fcs"  "Activation_4.fcs"  "Activation_5.fcs"  "Activation_6.fcs"  "Activation_7.fcs"  "Activation_8.fcs" 
 [9] "Activation_9.fcs"  "Activation_10.fcs" "Activation_11.fcs" "Activation_12.fcs" "Activation_13.fcs" "Activation_14.fcs" "Activation_15.fcs" "Activation_16.fcs"
[17] "Activation_17.fcs" "Activation_18.fcs" "Activation_19.fcs" "Activation_20.fcs" "Activation_21.fcs" "Activation_22.fcs" "Activation_23.fcs" "Activation_24.fcs"
[25] "Activation_25.fcs" "Activation_26.fcs" "Activation_27.fcs" "Activation_28.fcs" "Activation_29.fcs" "Activation_30.fcs" "Activation_31.fcs" "Activation_32.fcs"
[33] "Activation_33.fcs"
> identifier(cs_new[[1]])
[1] "Activation_1.fcs"

Basically, to respond to your point "since cs[[1]] is currently a flowFrame", it doesn't have to be. That's just default.

DillonHammill commented 4 years ago

Thanks @jacobpwagner I assumed that there was a reason why a cytoframe is not returned by default?

I frequently use the flowFrame identifiers in CytoExploreR, particularly within cyto_plot() which does almost everything at the flowFrame level. For example, identifiers are used in titles and legends. If this information does not align with the original names users can't tell which sample is which.

The only way that users know which sample is which is based on the inital file name, so think it is important that this retained in some way.

jacobpwagner commented 4 years ago

Yeah, I know it's sort of confusing. I tried to call it out in the cytoset doc methods section, but there's a lot there. The only reason a cytoframe is not returned from cytoset[[]] is for backwards compatibility with user scripts that expect to get a flowFrame back. In particular, it's a problem if they assume they can modify it without changes cascading to the original. If the reference nature of cytoframe is a problem, you can always force a copy of the cytoframe returned (which is effectively what was happening with flowSet[[]] but for a flowFrame).

DillonHammill commented 4 years ago

OK thanks that makes sense.

jacobpwagner commented 4 years ago

Also, $GUID isn't standard, but $FIL is standard (albeit technically optional). That's probably a better way to tie the samples to their initial file names if you want to grab a keyword.

DillonHammill commented 4 years ago

Yeah I thought I might do that instead as identifier preferentially checks the GUID slot but at the moment both slots contain temp file names.

@jacobpwagner do you think it is better to loop as you did instead of using the fsApply approach? I want to settle on a standard way of doing this so I can wrap it up nicely and only have to modify it in a single location.

jacobpwagner commented 4 years ago

Hey @DillonHammill . Sorry for the delay. Yeah, the lapply approach is better. Just make sure you are keeping in mind the fact that the cytoframes are references and you should be fine. Also, after some offline discussion, we will probably be switching the [[ subsetting behavior for cytoset to returning a cytoframe by default in the future.

DillonHammill commented 4 years ago

Thanks @jacobpwagner, I thought this approach would be better in the long run. I will add this to CytoExploreR and add a copy option so that I can control when the underlying data is modified. Please let me know if you do decide to drop the current GUID slot behaviour so that I can make the necessary changes.

DillonHammill commented 4 years ago

Closing this issue as it is no longer required as described in #344.