Allow protection of observations in 'filter' statistics

aphalo / ggpp

Grammar of graphics extensions to 'ggplot2'

122 stars 10 forks source link

Allow protection of observations in 'filter' statistics #19

Closed aphalo closed 1 year ago

aphalo commented 1 year ago

In statics that filter part of the data based on density, add a parameter protect and implement a way of ensuring that some labels are never dropped.

mschubert commented 1 year ago

Thanks for taking the initiative here, I think the idea would be a good match for ggpp!

I'm wondering, given that you already have your family of keep.* parameters, would it be easier to understand if instead of protect you could add e.g. a keep.function argument?

stat_dens2d_filter(
  keep.fraction = 0.1,
  keep.number = Inf,
  keep.sparse = TRUE,
)

If this function could then take the data (or each data column) as argument including the computed ..density.. parameter (to stick with ggplot2-style computed parameter naming), a user could easily assemble their own filter logic.

This would enable a bit broader applicability compared to just taking the label vector.

aphalo commented 1 year ago

@mschubert As the parameter as I have written the code takes also vectors, I settled on keep.these. Thanks for your suggestion! protect was not a good choice...

If you have time, please, install the package from GitHub and test if this version solves the issue.

I am not convinced that such a flexible function is needed, as one can use after_stat(), stage() etc. in aes(). In addition it would overlap with keep.fraction and keep.number in functionality creating confussion. Anyway, could you provide an example of how you would use the proposed function?

I think, specially for genomics, it could be useful to be able to restrict labels to specific plot quadrants, and also to be able to set the arguments to keep.fraction and keep.number parameters separately for each quadrant. This is doable, but will need to wait until I have more free time, as it will most likely take a whole day to implement and test.

aphalo commented 1 year ago

I add some examples below, and close this issue as resolved. I opened issue #22 for the support of quadrants mentioned above.

library(ggplot2)
library(ggpp)
#> 
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate
library(ggrepel)
syms = c(letters[1:5], LETTERS[1:5], 0:9)
labs = do.call(paste0, expand.grid(syms, syms))
dset = data.frame(x=rnorm(1e3), y=rnorm(1e3), label=sample(labs, 1e3, replace=TRUE))
ggplot(dset, aes(x=x, y=y, label = label)) +
  geom_point(colour = "grey85") +
  stat_dens2d_filter(geom = "text_repel",
                     position = position_nudge_centre(x = 0.1, 
                                                      y = 0.1, 
                                                      direction = "radial"),
                     keep.number = 50,
                     keep.these = function(x) {x %in% c("aA", "bB", "cC")},
                     min.segment.length = 0) +
  theme_bw()


ggplot(dset, aes(x=x, y=y, label = label)) +
  geom_point(colour = "grey85") +
  stat_dens2d_filter(geom = "text_repel",
                     position = position_nudge_centre(x = 0.1, 
                                                      y = 0.1, 
                                                      direction = "radial"),
                     keep.number = 50,
                     keep.these = c("aA", "bB", "cC"),
                     min.segment.length = 0) +
  theme_bw()


library(magrittr)

ggplot(dset, aes(x=x, y=y, label = label)) +
  geom_point(colour = "grey85") +
  stat_dens2d_filter(data = . %>% subset(y >= 0),
                     geom = "text_repel",
                     position = position_nudge_centre(x = 0.1, 
                                                      y = 0.1, 
                                                      center_y = 0, 
                                                      direction = "radial"),
                     keep.number = 25,
                     keep.these = c("aA", "bB", "cC"),
                     min.segment.length = 0) +
  theme_bw()

^{Created on 2023-01-20 with reprex v2.0.2}

aphalo commented 1 year ago

@mschubert

If this function could then take the data (or each data column) as argument including the computed ..density.. parameter (to stick with ggplot2-style computed parameter naming), a user could easily assemble their own filter logic.

This would enable a bit broader applicability compared to just taking the label vector.

The possibility of having the density estimate at the coordinates of each observation added to the data returned by the stats is now implemented in the "quadrant.filters" development branch.

Could you provide an example of when the selection logic would need to be in a user-defined function? I would need some examples of use cases to be able to consider this feature.

Although, you did not suggest this directly, your question made me realize that allowing keep.fraction and keep.number, to be applied separately to each quadrant or half of a plot, as well as supporting use of different values for each of these params in each quadrant could be very useful. (The origin of the quadrant can be set by the user). This is now also implemented.

Thanks for the feedback!

mschubert commented 1 year ago

Sorry for the delayed answer @aphalo:

I've often got points of different categories, where one is more important to label than the other. So I want to label category A first, but if there is space also label category B, and so on.

A real example would be a T-cell and + cells with presented MHC peptide co-culture, where we want to quantify how much each of the peptides is enriched or depleted. The results look like this:

Here, the red dots are mutated peptides (filled red circles), reference peptides (empty red circles), and the control peptides in blue. Bigger dots are significant changes, small dots n.s.

Right now, I am using stat_dens2d_filter_g to select by color, which works well in most cases. But if there are too many points I still want to prioritize full red > empty red > blue (> small dots), while enforcing a density limit to keep the labels readable. This requires a custom prioritization that no tool will likely support out of the box.

As for the quadrants, I'm sure this would be useful! At the same time, I'm wondering if a more general approach would cover more cases.

For instance, if instead of a per_quadrant option, if I'd have a group_by parameter that I could set to paste(sign(x), sign(y)) this would be more general without much more complexity (on the user interface).

But that's just an idea, I don't have a strong opinion on that.

aphalo commented 1 year ago

@mschubert Hi. o.k., I now understand your use case. I need to think how it could be implemented. I reopen this issue not to forget that this is in my to do list.

aphalo commented 1 year ago

@mschubert Hi. I edited just now only stat_dens2d_labels(), adding parameter keep.these.target to select which column(s) in data are passed as first argument to the function passed as argument to keep.these. If a single column is selected then a vector is passed, otherwise a data frame. This new code is not yet tested, except that with defaults it does not change previous behaviour.

The changes to the code are small, and I do not expect performance to suffer much, so I will implement it also in the related functions. Hopefully, this adds flexibility that is useful to you.

Prioritizing by groups could be more effectively done, I think, using a variation on geom_text(), based on the approach of using check_overlap. I am not sure what is already possible by playing with the row order in data together with check_overlap.

aphalo commented 1 year ago

@mschubert Hi. I edited stat_dens2d_labels(), stat_dens2d_filter() and stat_dens2d_filter_g(), adding parameter keep.these.target to select which column(s) in data are passed as first argument to the function passed as argument to keep.these. If a single column is selected then a vector is passed, otherwise a data frame. This new code is not yet tested, except that with defaults it does not change previous behaviour.

The changes to the code are small, and I do not expect performance to suffer much, so I will implement it also in the related 1d functions. Hopefully, this adds flexibility that is useful to you and others.

Here are some thoughts about possible ways of setting priorities for plotting, using geometries. Prioritizing by groups could be done, I think, using geom_text(), and its check_overlap parameter. I think this should work, but at least with ggpp::geom_text_s() if segments are plotted, they will always be plotted. check_overlap is implemented in 'grid' and only for textGrob. If this is enough, prioritizing would requiere manipulating the order of rows in data. I do not know if check_overlap works across plot layers, if it does, then you could plot one group per layer. I cannot think of a way of setting priorities in a statistic as overlaps take place when the geom renders the plot as grid grobs. This approach might also work with geom_text_repel() and its max.overlaps parameter.

aphalo commented 1 year ago

@mschubert Added exclude.these to complement keep.these, and renamed keep.these.target into these.target.

I think this is as much flexibility as can be added to these six statistics. The algorithm is not based on choosing among closely located observations: where the density is high, none is selected. This approach keeps the code simple and agnostic about the order of rows in data.

I close this issue, but feel free to reopen it, or raise a new issue to discuss handling the priorities using a geometry. In the future I may play with the approach I described in the previous comment to at least check if it is viable. If you try it, please, let me know if it works or not.

Many thanks for sharing these ideas and use cases! The stats are now much more useful than they were, and I am surely going to use myself some of these enhancements in the future.