alexpghayes / distributions3

Probability Distributions as S3 Objects
https://alexpghayes.github.io/distributions3/
Other
100 stars 16 forks source link

Fix base plot "checking for variable name" #75

Closed mnlang closed 2 years ago

mnlang commented 2 years ago

Alex (@alexpghayes),

this contains a minor PR for fixing the warnings in the base plot generic encountered in PR #74 by @zeileis. The plot function currently checks if the variable name is a single upper case letter.

This only works for existing distribution objects:

R> N <- Normal(1)
R> plot(N)
R> plot(Normal(1))
Warning message:
In if (nchar(variable_name) == 1 & toupper(variable_name) == variable_name) { :
  the condition has length > 1 and only the first element will be used

More generally, would you like me to write an issue or even a PR to unify the plot functions? Currently there is a plot.distributions3() with the boolean argument cdf, as well as the functions plot_pdf() and plot_cdf() based on ggplot2:

zeileis commented 2 years ago

Thanks, Moritz. (As I had already indicated in the Poisson PR I was about to provide a fix for that in a separate PR.)

The problem is that substitute(x) is not enough as this provides a "call" here. The usual fix would be to use deparse(substitute(x)). In addition the && can be added as you proposed. Finally, letting the user choose a variable_name explicitly would also be helpful in some cases.

Regarding the type = c("pdf", "cdf") argument you suggest in the comment above: I would recommend to use a different argument name for that. It would be nice to export the type = NULL argument and then switch to type = "p" etc. as appropriate - but then the user can also set, for example, type = "o" or type = "h". This is what I would have liked to do in the poisson vignette.

zeileis commented 2 years ago

Separate but related question (possibly better for a separate issue): The way the plot() method determines whether a distribution is discrete or not is quite ugly and not extensible. Should this be modularized in some way? There is already the support() extractor but it only returns the minimum and maximum for a given distribution. Maybe adding some sort of extractor for the "minimum increment" would be helpful? This would be 0 for continuous distributions and 1 for most discrete distributions.

alexpghayes commented 2 years ago

More generally, would you like me to write an issue or even a PR to unify the plot functions? Currently there is a plot.distributions3() with the boolean argument cdf, as well as the functions plot_pdf() and plot_cdf() based on ggplot2:

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

In a further step it might be even possible to provide an interface to the package ggdist?

I'm intrigued, what are you thinking here?

The way the plot() method determines whether a distribution is discrete or not is quite ugly and not extensible.

I'm open to changing this; basically the current way we represent a distribution is a quickly thrown together hack. I started thinking about a more careful class system a while back but didn't really have the capacity to push it anywhere. As a start it seems like subclasses for discrete and for continuous distributions would be an easy win, and I would welcome a PR to make those changes.

mnlang commented 2 years ago

The problem is that substitute(x) is not enough as this provides a "call" here. The usual fix would be to use deparse(substitute(x)). In addition the && can be added as you proposed. Finally, letting the user choose a variable_name explicitly would also be helpful in some cases.

My bad. I have adjusted the code accordingly!

mnlang commented 2 years ago

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

I understand your point, on the other hand plot() and autoplot() distinguish more clearly between base and ggplot2 graphics. Aliasing might really be a good compromise by first checking if ggplot2 is loaded and rendering the plot accordingly?

But thinking about it again, wouldn't that imply that the arguments are 1:1 interchangeable - which will not be really the case. Hence, I think we should rather work with plot and autoplot and put parts of the autoplot function into individual exported geoms (as it's already partly done).

What do you think, @alexpghayes and @zeileis?

rmtrane commented 2 years ago

Ah yes we do seem to have a lot of plotting functions floating around. Instead of a single plot generic with type argument, I do prefer plot_pdf() and plot_cdf() from a naming perspective, and then to either get rid of plot() and autoplot() or to alias them to them to plot_pdf(). Thoughts?

I understand your point, on the other hand plot() and autoplot() distinguish more clearly between base and ggplot2 graphics. Aliasing might really be a good compromise by first checking if ggplot2 is loaded and rendering the plot accordingly?

But thinking about it again, wouldn't that imply that the arguments are 1:1 interchangeable - which will not be really the case. Hence, I think we should rather work with plot and autoplot and put parts of the autoplot function into individual exported geoms (as it's already partly done).

What do you think, @alexpghayes and @zeileis?

When I first wrote the plot_cdf and plot_pdf functions, I simply needed something quick and reusable for plotting CDFs and PDFs for an intro stats class I taught. Coincidentally, @paulnorthrop wrote the plot and autoplot function at about the same time, and we decided it was easier for the time being to simply include both. It was mainly due to a lack of time to find out what would be the best approach.

My original idea was to create geom_pdf and geom_cdf to make it easy to both plot PDFs and CDFs, but also to overlay distributions onto histograms. My goal originally was to get to a point where I could do something like

# A normal curve
ggplot() + geom_pdf(d = Normal()) 

# Normal curve on top of histogram. Might be super tricky cause you need to be careful about
# bin sizes and normalize, but never got to a point where that was the issue...
ggplot(data = students, aes(x = height)) + geom_histogram(...) + geom_pdf(d = Normal(..., ...))

Unfortunately, I didn't have time to dive deep enough into how to write geom_* and stat_* functions -- as you can tell from my code, I tried, but didn't quite get there. I still think this would be the most elegant solution, but in my (limited) experience, extending ggplot2 is hard...

As @alexpghayes, I would prefer to keep the plot_cdf() and plot_pdf(). Personally, I wouldn't mind making ggplot2 a dependency for this and getting rid of plot and autoplot, but I understand if you want to keep a base option around.

Anyway, just wanted to chime in. If anyone wants to work on an implementation that functions as a ggplot2 extension, I think that would be awesome, and I'd happily help.

zeileis commented 2 years ago

Thanks everybody for the discussion so far. I will try to structure my feedback on the different aspects a bit.

Discrete vs. continuous

Functions vs. methods

ggplot2 dependency

alexpghayes commented 2 years ago

Merging the fix in for now, let's continue discussion about plotting code refactor at #77.