Setting pData factor levels

DillonHammill commented 4 years ago

Hi @mikejiang & @jacobpwagner ,

I am having trouble setting factor levels of pData variables using the latest flowWorkspace. I used to be able to do this:

pData(gs)$Treatment <- factor(pData(gs)$Treatment, levels = c("A","B","C","D"))
levels(pData(gs)$Treatment)
> [1] "A" "B" "C" "D"

Now the column retains the original class and is not converted to a factor.

pData(gs)$Treatment <- factor(pData(gs)$Treatment, levels = c("A","B","C","D"))
levels(pData(gs)$Treatment)
> NULL

Any idea why this is happening?

mikejiang commented 4 years ago

Because each column of pData is now simply treated/stored as string in H5 without preserving their extra attributes or types. We'll need to discuss whether we want to redesign the pData structure in h5 in order to keep these info.

DillonHammill commented 4 years ago

I certainly use it a lot to order samples by experiment variables, it would be great if we could keep this capability.

mikejiang commented 4 years ago

Do you have to use factor?Ordering doesn't necessarily require that.

DillonHammill commented 4 years ago

How would you do it? I need the ordering to be retained by the GatingSet and it should not affect the sample order.

mikejiang commented 4 years ago

I don't fully understand your intention. Please give a concrete use case showing that character isn't sufficient and factor is necessary

gfinak commented 4 years ago

@DillonHammill Default factors are not ordered. Do you mean that you use levels which returns the factors in alphabetical order? If that's the case, the equivalent would be sort(unique(x)), though if there are factor levels missing, this would not capture that.

jacobpwagner commented 4 years ago

Well, factor levels always have a default ordering, usually dictated by the values as they appear in the data.frame, right? So, I think maybe @DillonHammill wants to be able to arbitrarily change that ordering and have that be persistent, maybe for the sake of summaries or plots later.

gfinak commented 4 years ago

No, they are alphabetical:

> factor(c("B","C","A"))
[1] B C A
Levels: A B C

jacobpwagner commented 4 years ago

Oh, right, by default. For some reason I was thinking I ran in to an exception to that, but it must have been manually reordered. But anyway, if for some reason he has a method that arbitrarily reorders the levels then that would be a problem if it doesn't persist.

jacobpwagner commented 4 years ago

Basically, even if a factor is not ordered (as in the the ordered flag is not set to true and the factor levels should not be considered ordinal), default group ordering information for tables/plots/summaries can still be preserved (which may be related to the use case here). That is levels returns the underlying levels vector order, not re-alphabetizing on the fly:

> blah <- iris
> levels(blah$Species)
[1] "setosa"     "versicolor" "virginica" 
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)

    setosa versicolor  virginica 
        50         50         50 
> levels(blah$Species) <- c("versicolor", "virginica", "setosa")
> levels(blah$Species)
[1] "versicolor" "virginica"  "setosa"
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)

versicolor  virginica     setosa 
        50         50         50 
> blah$Species <- factor(blah$Species, levels = c("virginica", "versicolor", "setosa"))
> levels(blah$Species)
[1] "virginica"  "versicolor" "setosa"    
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)

 virginica versicolor     setosa 
        50         50         50

But in order to preserve that behavior in the h5 we would have to add an attribute to retain the order.

DillonHammill commented 4 years ago

Thanks for looking into this, as @jacobpwagner mentioned I primarily use this for summaries and plots.

For example, in my plotting function cyto_plot users can supply a GatingSet/cytoset and request that samples be grouped or ordered by experiment variables when plotting. Users may for example want to plot the unstimulated samples in the first panel and stimulated samples in the second (this is hardly ever in alphabetical order). It much easier for users if this is taken care of behind the scenes so that they don't have to manually supply the indices to order the samples for plotting (we are normally dealing with hundreds of samples). Here is an example where samples are ordered by two variables (stimulus and antigen concentration):

cyto_plot(gs,
          parent = "CD4 T Cells",
          alias = "",
          channels = c("CD44", "CD69"),
          group_by = c("Treatment", "OVAConc")) # pData variables

Visualisations-24

I also use pData to to store experimental information which is exported with calculated statistics. Setting factor levels early on in the pipeline means that it can be used to order/group samples for plotting within CytoExploreR but also be retained in exported statistics for plotting with ggplot.

jacobpwagner commented 4 years ago

That's kind of what I suspected. @mikejiang , maybe we could add representation of the levels vectors to the h5 for the columns to match R-level factor columns. We could then check those levels vectors in the process of gs_cyto_data to mimic the R-level levels method:

levels.default <- function(x) attr(x, "levels")

of course, this starts to look a bit like we're re-inventing the wheel of data.frame.

jacobpwagner commented 4 years ago

Just documenting some of our offline discussions:

2 possible approaches:

1. The distributed approach: Expand the pData representation within each CytoFrame to handle more than simple string-string key-value mapping, thus allowing factor levels to be carried as well. That is, instead of this: https://github.com/RGLab/cytolib/blob/d0f70a2549e035f81169301b531554907f8b9c1f/inst/include/cytolib/CytoFrame.hpp#L29

typedef unordered_map<string, string> PDATA;

do something like this:

enum class pDataType {pd_int, pd_dbl, pd_str, pd_bool};
struct PDATA_VAL{
    pDataType type;
    string value; // can be cast to appropriate return type
    set<string> levels; // only non-empty for factors
};
typedef unordered_map<string, PDATA_VAL> PDATA;

This extra information would then also have to be written out to the pdata H5 DataSet

Pros:

Less code updating as this requires minimal changes to the cytolib::GatingSet and leaves in place the status quo of distributed pData
Allows access/reconstitution of metadata from distributed samples, without the need for the protocol buffer file

Cons:

Some more code updating from having to add more to the pdata DataSet in the H5 representation (more work updating H5CytoFrame) -- this is also complicated by the fact that H5 is not really designed for jagged structures with mixed data types.
Need to enforce consistency in levels and data types across cytoframes

2. The centralized approach: Add a C++ level representation of the pData data.frame to cytolib::GatingSet and save this out to the protocol buffer, maintaining one centralized store of group metadata (like existed in the R representation before the cytoset merge and still exists for flowFrame/flowSet in flowCore)

Pros:

Column consistency is managed in a centralized data structure, so no need to check when combining results from get_pheno_data for each CytoFrame.
Protocol buffers should make it relatively easy to add annotations with variable length and mixed type
pdata DataSet in the H5 files for each CytoFrame could be removed entirely

Cons:

Possibly significantly more code updating to be done in cytolib::GatingSet to handle the new pData data member, with particular caution needing to be paid to consistently slicing/subsetting related data structures
Requires updating the protocol buffer interface
Archiving pData of even an un-gated set of samples requires the pb file as well.

We can discuss run-time efficiency as well, but the crux of the issue may be where the metadata should live. In some ways it's nice to have it attached to the H5 for each sample's cytoframe, but maybe it doesn't make sense to store group-level information (like factor levels) with individual samples.

gfinak commented 4 years ago

I favor the first approach.

DillonHammill commented 4 years ago

@jacobpwagner, just letting you know that I have a workaround for ordering things now. This is no longer essential. Users can just specify the ordering in a list when plotting etc and set factor levels on the exported stats prior to plotting with ggplot2.

It may actually be beneficial to not change this as then we don't have to worry about differing column classes.

jacobpwagner commented 4 years ago

Thanks for the update @DillonHammill. This was still on my list, but it's good to know it's not urgent for your use case.

DillonHammill commented 4 years ago

Thanks for using the NEWS.md to track changes, this is very useful.

osthomas commented 4 years ago

Thanks for looking into this, as @jacobpwagner mentioned I primarily use this for summaries and plots.

[ ... ]

I also use pData to to store experimental information which is exported with calculated statistics. Setting factor levels early on in the pipeline means that it can be used to order/group samples for plotting within CytoExploreR but also be retained in exported statistics for plotting with ggplot.

Hello,

I'd just like to chime in and add support to this use case. I played around with CytoExploreR (thanks @DillonHammill by the way for your development work on this - of course, this extends to the whole cyto suite) and had to update to the development version of flowWorkspace for that purpose.

I really like to use ggcyto in conjunction with facet_grid to get a quick overview of my samples, and clarity suffers quite a bit from not being able to specify order arbitrarily. And while it is possible to add the desired ordering further downstream in the analysis pipeline (after extracting stats), it is tedious to do so (for me at least, because I prefer to take care of such things when reading in data and setting up the environment for analysis).

jacobpwagner commented 4 years ago

Just updating with a note here that it would also be helpful to store other sorts of type information attached to the pheno data stored at the cytolib level. Due to everything being stored as a string and returning to R as a character type, information about which variables should be true numerical types (integer, floating-point) is lost. A few problems can arise from this. For example, columns attached as integers via pData<- will return to R using pData as characters, which will change their sort order:

> nums <- 1:20
> charnums <- as.character(nums)
> sort(nums)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> sort(charnums)
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "3"  "4"  "5"  "6"  "7"  "8"  "9"

Also, we can not assume all columns that are capable of conversion to numerical types when loaded back in to R should in fact be converted. For example, subject IDs often have formatted digit strings that are not necessarily supposed to be converted in to simple numeric types. If the character id stored is "001", type conversion would succeed but make it just 1 (integer). Similarly, subject IDs with decimals like "1.24" implying arm 1 subject 24 would be converted to doubles.

So, point being, we ultimately will likely need type information attached to the pheno data stored at the C++ level in cytolib.

herbstk commented 4 years ago

I arrived here because I had the issue that pData(cytoset)<- assignments mess up the column order and loose the column types of the provided data.frame, I guess because of the translation into C structures. After reading this thread and #294 I believe I got the point of the transition from flowSet to cytoset except for the handling of metadata.

To my understanding phenoData/pData were designed to hold arbitrary metadata (as I read in the Biobase intro vignette) and the normalization by pData(cytoset)<- to strings goes against that. I was previously using ncdfFlowSet of which I liked the separation of lightweight metadata as R structures from heavyweight cytodata as on-disk structures. I guess this means that people will then start to keep cytometry data and sample metadata as separate objects and having a pData slot becomes kind of obsolete.

Maybe you could improve the cytoset help page for the pData function to make it clearer for the users what to expect.

rebeccaongmtu commented 6 months ago

This situation where the pData of the gatingset is forced to characters makes it challenging to integrate with typical ggplot functionality. Usually, I rely on factor levels to order facets in my plots, but I'm unable to do that with the gatingset data.

I'm trying to combine ggcyto with ggridges to obtain stacked density plots that are ordered based on sampling time. In my flowset, my variables are set up in the pData as factors but because the pData for the gatingset forces all the variables to characters, the stacked plots end up ordered illogically. I also cannot reorder the facets - logically the 50:50 should be between the live and dead cell populations. Do any of you know of another approach to reorder that doesn't rely on pData factor levels?

ggcyto(gs_sc,aes(x=FL1.A, fill = sampling_time), subset = "singlets") + theme_bw() + geom_density_ridges(aes(y = sampling_time), alpha = 0.5) + facet_grid(treatment~.)

RGLab / flowWorkspace

Setting pData factor levels #297