Fix base plot "checking for variable name"

mnlang commented 2 years ago

Alex (@alexpghayes),

this contains a minor PR for fixing the warnings in the base plot generic encountered in PR #74 by @zeileis. The plot function currently checks if the variable name is a single upper case letter.

This only works for existing distribution objects:

R> N <- Normal(1)
R> plot(N)
R> plot(Normal(1))
Warning message:
In if (nchar(variable_name) == 1 & toupper(variable_name) == variable_name) { :
  the condition has length > 1 and only the first element will be used

More generally, would you like me to write an issue or even a PR to unify the plot functions? Currently there is a plot.distributions3() with the boolean argument cdf, as well as the functions plot_pdf() and plot_cdf() based on ggplot2:

In a first step, it would be reasonable to unify plot() and autoplot() with the arguments type = c("pdf", "cdf")).
In a further step it might be even possible to provide an interface to the package ggdist?

zeileis commented 2 years ago

Thanks, Moritz. (As I had already indicated in the Poisson PR I was about to provide a fix for that in a separate PR.)

The problem is that substitute(x) is not enough as this provides a "call" here. The usual fix would be to use deparse(substitute(x)). In addition the && can be added as you proposed. Finally, letting the user choose a variable_name explicitly would also be helpful in some cases.

Regarding the type = c("pdf", "cdf") argument you suggest in the comment above: I would recommend to use a different argument name for that. It would be nice to export the type = NULL argument and then switch to type = "p" etc. as appropriate - but then the user can also set, for example, type = "o" or type = "h". This is what I would have liked to do in the poisson vignette.

zeileis commented 2 years ago

Separate but related question (possibly better for a separate issue): The way the plot() method determines whether a distribution is discrete or not is quite ugly and not extensible. Should this be modularized in some way? There is already the support() extractor but it only returns the minimum and maximum for a given distribution. Maybe adding some sort of extractor for the "minimum increment" would be helpful? This would be 0 for continuous distributions and 1 for most discrete distributions.

alexpghayes commented 2 years ago

More generally, would you like me to write an issue or even a PR to unify the plot functions? Currently there is a plot.distributions3() with the boolean argument cdf, as well as the functions plot_pdf() and plot_cdf() based on ggplot2:

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

In a further step it might be even possible to provide an interface to the package ggdist?

I'm intrigued, what are you thinking here?

The way the plot() method determines whether a distribution is discrete or not is quite ugly and not extensible.

I'm open to changing this; basically the current way we represent a distribution is a quickly thrown together hack. I started thinking about a more careful class system a while back but didn't really have the capacity to push it anywhere. As a start it seems like subclasses for discrete and for continuous distributions would be an easy win, and I would welcome a PR to make those changes.

mnlang commented 2 years ago

The problem is that substitute(x) is not enough as this provides a "call" here. The usual fix would be to use deparse(substitute(x)). In addition the && can be added as you proposed. Finally, letting the user choose a variable_name explicitly would also be helpful in some cases.

My bad. I have adjusted the code accordingly!

mnlang commented 2 years ago

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

I understand your point, on the other hand plot() and autoplot() distinguish more clearly between base and ggplot2 graphics. Aliasing might really be a good compromise by first checking if ggplot2 is loaded and rendering the plot accordingly?

But thinking about it again, wouldn't that imply that the arguments are 1:1 interchangeable - which will not be really the case. Hence, I think we should rather work with plot and autoplot and put parts of the autoplot function into individual exported geoms (as it's already partly done).

What do you think, @alexpghayes and @zeileis?

rmtrane commented 2 years ago

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

I understand your point, on the other hand plot() and autoplot() distinguish more clearly between base and ggplot2 graphics. Aliasing might really be a good compromise by first checking if ggplot2 is loaded and rendering the plot accordingly?

But thinking about it again, wouldn't that imply that the arguments are 1:1 interchangeable - which will not be really the case. Hence, I think we should rather work with plot and autoplot and put parts of the autoplot function into individual exported geoms (as it's already partly done).

What do you think, @alexpghayes and @zeileis?

When I first wrote the plot_cdf and plot_pdf functions, I simply needed something quick and reusable for plotting CDFs and PDFs for an intro stats class I taught. Coincidentally, @paulnorthrop wrote the plot and autoplot function at about the same time, and we decided it was easier for the time being to simply include both. It was mainly due to a lack of time to find out what would be the best approach.

My original idea was to create geom_pdf and geom_cdf to make it easy to both plot PDFs and CDFs, but also to overlay distributions onto histograms. My goal originally was to get to a point where I could do something like

# A normal curve
ggplot() + geom_pdf(d = Normal()) 

# Normal curve on top of histogram. Might be super tricky cause you need to be careful about
# bin sizes and normalize, but never got to a point where that was the issue...
ggplot(data = students, aes(x = height)) + geom_histogram(...) + geom_pdf(d = Normal(..., ...))

Unfortunately, I didn't have time to dive deep enough into how to write geom_* and stat_* functions -- as you can tell from my code, I tried, but didn't quite get there. I still think this would be the most elegant solution, but in my (limited) experience, extending ggplot2 is hard...

As @alexpghayes, I would prefer to keep the plot_cdf() and plot_pdf(). Personally, I wouldn't mind making ggplot2 a dependency for this and getting rid of plot and autoplot, but I understand if you want to keep a base option around.

Anyway, just wanted to chime in. If anyone wants to work on an implementation that functions as a ggplot2 extension, I think that would be awesome, and I'd happily help.

zeileis commented 2 years ago

Thanks everybody for the discussion so far. I will try to structure my feedback on the different aspects a bit.

Discrete vs. continuous

As the nature of the variable is already implied by the class name itself, I wouldn't do anything else to store that information in the object itself, but put it into a dedicated method like is_discrete() or minincrement() or something along those lines.
My feeling is that separate classes for this would be overkill. I would typically use a different class if the structure of the object changes, e.g., some list elements don't exist or have different names, etc. But this is not the case here.
Finally, I think it will be hard to come up with a structure that generally covers all possible forms of support that are conceivable (this could be very complex). But most of the cases we need in practice are either continuous in an interval or discrete with steps of 1. So I would aim at making the computations for these important special cases easy. Everything else could be handled by writing a dedicated plot.class method that might call the general plot.distribution function internally after setting up some variables (like ranges, breaks, etc.).

Functions vs. methods

I don't like the split into plot_pdf() and plot_cdf() because this is not very "DRY" (don't repeat yourself). I think there is more what the PDF and CDF visualizations have in common than what differs, so I would keep them in a single function.
I'm not sure if the same reasoning would apply to the base graphics vs. ggplot2 visualizations. I think many aspects between these are quite different so my feeling is that they could be kept separate.
I agree with Moritz that plot() and autoplot() are nice and intuitively distinguish between the plotting systems. (And even if plot_*df() functions are kept around I would recommend building the methods on top of them, just for convenience.)

ggplot2 dependency

Given that visualization is not the main focus of the package I think it is worth trying not to have ggplot2 as a strong dependency but just as a Suggests for example. In other packages I have done so by registering the autoplot() methods only conditionally on ggplot2 being loaded, adding a stopifnot(requireNamespace("ggplot2")) in the function, and only using fully-qualified ggplot:: function names internally.
But Moritz might have thought about this some more than I have and has also looked into other infrastructure for geoms etc.

alexpghayes commented 2 years ago

Merging the fix in for now, let's continue discussion about plotting code refactor at #77.

alexpghayes / distributions3

Fix base plot "checking for variable name" #75