Open DillonHammill opened 4 years ago
Because each column of pData
is now simply treated/stored as string
in H5 without preserving their extra attributes or types. We'll need to discuss whether we want to redesign the pData
structure in h5
in order to keep these info.
I certainly use it a lot to order samples by experiment variables, it would be great if we could keep this capability.
Do you have to use factor?Ordering doesn't necessarily require that.
How would you do it? I need the ordering to be retained by the GatingSet and it should not affect the sample order.
I don't fully understand your intention. Please give a concrete use case showing that character isn't sufficient and factor is necessary
@DillonHammill Default factors are not ordered. Do you mean that you use levels
which returns the factors in alphabetical order? If that's the case, the equivalent would be sort(unique(x))
, though if there are factor levels missing, this would not capture that.
Well, factor levels always have a default ordering, usually dictated by the values as they appear in the data.frame
, right? So, I think maybe @DillonHammill wants to be able to arbitrarily change that ordering and have that be persistent, maybe for the sake of summaries or plots later.
No, they are alphabetical:
> factor(c("B","C","A"))
[1] B C A
Levels: A B C
Oh, right, by default. For some reason I was thinking I ran in to an exception to that, but it must have been manually reordered. But anyway, if for some reason he has a method that arbitrarily reorders the levels then that would be a problem if it doesn't persist.
Basically, even if a factor is not ordered (as in the the ordered flag is not set to true and the factor levels should not be considered ordinal), default group ordering information for tables/plots/summaries can still be preserved (which may be related to the use case here). That is levels
returns the underlying levels vector order, not re-alphabetizing on the fly:
> blah <- iris
> levels(blah$Species)
[1] "setosa" "versicolor" "virginica"
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)
setosa versicolor virginica
50 50 50
> levels(blah$Species) <- c("versicolor", "virginica", "setosa")
> levels(blah$Species)
[1] "versicolor" "virginica" "setosa"
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)
versicolor virginica setosa
50 50 50
> blah$Species <- factor(blah$Species, levels = c("virginica", "versicolor", "setosa"))
> levels(blah$Species)
[1] "virginica" "versicolor" "setosa"
> is.ordered(blah$Species)
[1] FALSE
> table(blah$Species)
virginica versicolor setosa
50 50 50
But in order to preserve that behavior in the h5
we would have to add an attribute to retain the order.
Thanks for looking into this, as @jacobpwagner mentioned I primarily use this for summaries and plots.
For example, in my plotting function cyto_plot
users can supply a GatingSet/cytoset
and request that samples be grouped or ordered by experiment variables when plotting. Users may for example want to plot the unstimulated samples in the first panel and stimulated samples in the second (this is hardly ever in alphabetical order). It much easier for users if this is taken care of behind the scenes so that they don't have to manually supply the indices to order the samples for plotting (we are normally dealing with hundreds of samples). Here is an example where samples are ordered by two variables (stimulus and antigen concentration):
cyto_plot(gs,
parent = "CD4 T Cells",
alias = "",
channels = c("CD44", "CD69"),
group_by = c("Treatment", "OVAConc")) # pData variables
I also use pData
to to store experimental information which is exported with calculated statistics. Setting factor levels early on in the pipeline means that it can be used to order/group samples for plotting within CytoExploreR but also be retained in exported statistics for plotting with ggplot.
That's kind of what I suspected. @mikejiang , maybe we could add representation of the levels
vectors to the h5
for the columns to match R-level factor columns. We could then check those levels
vectors in the process of gs_cyto_data
to mimic the R-level levels
method:
levels.default <- function(x) attr(x, "levels")
of course, this starts to look a bit like we're re-inventing the wheel of data.frame
.
Just documenting some of our offline discussions:
2 possible approaches:
1. The distributed approach:
Expand the pData
representation within each CytoFrame
to handle more than simple string-string key-value mapping, thus allowing factor levels to be carried as well. That is, instead of this:
https://github.com/RGLab/cytolib/blob/d0f70a2549e035f81169301b531554907f8b9c1f/inst/include/cytolib/CytoFrame.hpp#L29
typedef unordered_map<string, string> PDATA;
do something like this:
enum class pDataType {pd_int, pd_dbl, pd_str, pd_bool};
struct PDATA_VAL{
pDataType type;
string value; // can be cast to appropriate return type
set<string> levels; // only non-empty for factors
};
typedef unordered_map<string, PDATA_VAL> PDATA;
This extra information would then also have to be written out to the pdata
H5 DataSet
Pros:
cytolib::GatingSet
and leaves in place the status quo of distributed pData
Cons:
pdata
DataSet
in the H5 representation (more work updating H5CytoFrame
) -- this is also complicated by the fact that H5 is not really designed for jagged structures with mixed data types.2. The centralized approach:
Add a C++ level representation of the pData
data.frame
to cytolib::GatingSet
and save this out to the protocol buffer, maintaining one centralized store of group metadata (like existed in the R representation before the cytoset
merge and still exists for flowFrame
/flowSet
in flowCore
)
Pros:
get_pheno_data
for each CytoFrame
.pdata
DataSet in the H5 files for each CytoFrame
could be removed entirelyCons:
cytolib::GatingSet
to handle the new pData
data member, with particular caution needing to be paid to consistently slicing/subsetting related data structurespData
of even an un-gated set of samples requires the pb
file as well.We can discuss run-time efficiency as well, but the crux of the issue may be where the metadata should live. In some ways it's nice to have it attached to the H5 for each sample's cytoframe, but maybe it doesn't make sense to store group-level information (like factor levels) with individual samples.
I favor the first approach.
@jacobpwagner, just letting you know that I have a workaround for ordering things now. This is no longer essential. Users can just specify the ordering in a list when plotting etc and set factor levels on the exported stats prior to plotting with ggplot2.
It may actually be beneficial to not change this as then we don't have to worry about differing column classes.
Thanks for the update @DillonHammill. This was still on my list, but it's good to know it's not urgent for your use case.
Thanks for using the NEWS.md to track changes, this is very useful.
Thanks for looking into this, as @jacobpwagner mentioned I primarily use this for summaries and plots.
[ ... ]
I also use
pData
to to store experimental information which is exported with calculated statistics. Setting factor levels early on in the pipeline means that it can be used to order/group samples for plotting within CytoExploreR but also be retained in exported statistics for plotting with ggplot.
Hello,
I'd just like to chime in and add support to this use case. I played around with CytoExploreR (thanks @DillonHammill by the way for your development work on this - of course, this extends to the whole cyto suite) and had to update to the development version of flowWorkspace for that purpose.
I really like to use ggcyto in conjunction with facet_grid
to get a quick overview of my samples, and clarity suffers quite a bit from not being able to specify order arbitrarily. And while it is possible to add the desired ordering further downstream in the analysis pipeline (after extracting stats), it is tedious to do so (for me at least, because I prefer to take care of such things when reading in data and setting up the environment for analysis).
Just updating with a note here that it would also be helpful to store other sorts of type information attached to the pheno data stored at the cytolib
level. Due to everything being stored as a string and returning to R as a character type, information about which variables should be true numerical types (integer, floating-point) is lost. A few problems can arise from this. For example, columns attached as integers via pData<-
will return to R using pData
as characters, which will change their sort order:
> nums <- 1:20
> charnums <- as.character(nums)
> sort(nums)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> sort(charnums)
[1] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "3" "4" "5" "6" "7" "8" "9"
Also, we can not assume all columns that are capable of conversion to numerical types when loaded back in to R should in fact be converted. For example, subject IDs often have formatted digit strings that are not necessarily supposed to be converted in to simple numeric types. If the character id stored is "001", type conversion would succeed but make it just 1 (integer). Similarly, subject IDs with decimals like "1.24" implying arm 1 subject 24 would be converted to doubles.
So, point being, we ultimately will likely need type information attached to the pheno data stored at the C++ level in cytolib
.
I arrived here because I had the issue that pData(cytoset)<-
assignments mess up the column order and loose the column types of the provided data.frame, I guess because of the translation into C structures. After reading this thread and #294 I believe I got the point of the transition from flowSet
to cytoset
except for the handling of metadata.
To my understanding phenoData/pData
were designed to hold arbitrary metadata (as I read in the Biobase intro vignette) and the normalization by pData(cytoset)<-
to strings goes against that. I was previously using ncdfFlowSet
of which I liked the separation of lightweight metadata as R structures from heavyweight cytodata as on-disk structures. I guess this means that people will then start to keep cytometry data and sample metadata as separate objects and having a pData slot becomes kind of obsolete.
Maybe you could improve the cytoset
help page for the pData
function to make it clearer for the users what to expect.
This situation where the pData of the gatingset is forced to characters makes it challenging to integrate with typical ggplot functionality. Usually, I rely on factor levels to order facets in my plots, but I'm unable to do that with the gatingset data.
I'm trying to combine ggcyto with ggridges to obtain stacked density plots that are ordered based on sampling time. In my flowset, my variables are set up in the pData as factors but because the pData for the gatingset forces all the variables to characters, the stacked plots end up ordered illogically. I also cannot reorder the facets - logically the 50:50 should be between the live and dead cell populations. Do any of you know of another approach to reorder that doesn't rely on pData factor levels?
ggcyto(gs_sc,aes(x=FL1.A, fill = sampling_time), subset = "singlets") + theme_bw() + geom_density_ridges(aes(y = sampling_time), alpha = 0.5) + facet_grid(treatment~.)
Hi @mikejiang & @jacobpwagner ,
I am having trouble setting factor levels of pData variables using the latest flowWorkspace. I used to be able to do this:
Now the column retains the original class and is not converted to a factor.
Any idea why this is happening?