GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 250 forks source link

Enhancements for biologists #103

Closed john9631 closed 10 years ago

john9631 commented 10 years ago

Daniel, After Muraveills comments on the Plotting Thread I was going through the things a biologist might want to do in the heretical world. I discovered some things that are needed to make Gadfly work well for them.

  1. As well as X and Y labels you need to be able to specify the Key (color or facet) and the Title so that they can have values if dataframes aren't being used. Note: I'm wondering at the moment if it would be better to let biologists do simple charts without dataframes and then teach them enough about dfs to convince them that they make life easier if they want to build complex plots (so possibly key labels are not needed).
  2. For the Geom.boxplot, if x is not specified it needs to default to the simplest case of one plot otherwise people have to construct an X with something like: X = ["A" for i in 1:size(df,1)] plot(x=X, y=df["time2"], Geom.boxplot)
  3. It would be nice if Geom.subplot_grid could handle Geom.smooth as well as Geom.Point (didnt work well when I added it).
  4. Have a look at the outputs for the two draw functions at the bottom of the ipynb.
  5. Also that boxplot of time2 seems to have a range beyond the maximum value. Is that right?

The data I used is rubbish but its at http://dropcanvas.com/gyk14 along with the notebook and my .png and .pdf outputs.

john9631 commented 10 years ago

An extra one. I can't see a bar chart (bar representation of Point Chart). Biologists draw a lot of bar charts :)

They actually need a bar chart that has characteristics of bar and error bar. So you have x, y, ymin and ymax. Why? Apparently bar charts with confidence intervals are popular. Basically you extend a vertical line from ymin to ymax on top of the bar ..

As an extension of that request. I wonder if this CI type Statistic could be built for other chart types as well in the same sort of form as ggplot2's one.

dcjones commented 10 years ago

This is quite the issue! Let me see if I have everything:

Lastly, I need to clean up the manual. Referring to using plot without a data frame as "heretical" was tongue-in-cheek, but people seem to think it's actually bad or wrong. I was really just making fun of myself for previously making plot too rigid.

john9631 commented 10 years ago

Thanks Daniel.

I will try x_discrete. If its good I'll just add it to my document otherwise I'll let you know any issues.

I'm amused by the heretical view ... I will continue extending the Reference card. I am essentially building a complement to the manual that lets "a biologist" pick it up and build some charts. I'm a bit stuck at present because there is a point you reach where dataframes are needed (at least I think they are) and I think I have to go through a structure like:

I'm basing it partly on trying to answer the questions raised by following Hadley's Graphics Cheat Sheet http://had.co.nz/stat480/r/graphics.html

Finally a question: Can I build a chart with 4 lines on it without using a dataframe? At the moment, for a line for each of 4 different bird species say in 4 vectors, I would stack the vectors along with a repeating index and 4 categories in a dataframe and then plot that.

dcjones commented 10 years ago

You can make that sort of plot without a data frame, but it would you'd still have to stack the vectors and make a new species vector of the same length as the stacked vectors to bind to color.

Originally everything was through data frames though, so there may still be places where it breaks down without one. Titles for color keys was one that you pointed out.

john9631 commented 10 years ago

Thats cool. I will take them in the direction of dfs but simplify it so that it doesn't sound like something "difficult." Something I don't understand fully is the relationship between Winston and Gadfly. Crudely oversimplifying I had thought that Winston was like "Pylab for Julia" and Gadfly was "ggplot2 for Julia" but I read a post about changes to Winston's syntax and its not clear to me. Can you clarify that for me. Feel free to email me at john . lynch at iname . com

I tested number 5 again. x_discrete made no difference. Nor did y_discrete. It would be nice to have an option where the tails of the boxplot were no longer than the actual range.

dcjones commented 10 years ago

Oh, I completely misread your original comment!

When you mentioned ranges, I thought you were referring to the first histogram plot you showed. It turns out the fences on boxplots were being incorrectly computed. That's fixed now. I think you should also be able to draw boxplots omitting the x aesthetic.

john9631 commented 10 years ago

Thanks Daniel.

dcjones commented 10 years ago

Titles of color keys can now be explicitly set like:

plot(..., Guide.colorkey("Color Key Title"))

Subplot grid titles can be set with Guide.xlabel and Guide.ylabel.

I fixed a bug with Geom.smooth and Geom.subplot_grid, so those two should play nice now.

Error bars should work correctly with bar plots, if used explicitly like so:

using RDatasets, DataFrames, Gadfly

df = subset(data("plm", "Cigar"), :(state .== 1))

# silly fake error bars intervals
ymin = df["sales"] .- 20*rand()
ymax = df["sales"] .+ 20*rand()

plot(df, x="year", y="sales", ymin=ymin, ymax=ymax,
     Geom.bar, Geom.errorbar)

Errorbar Barplot

I'm open to finding ways to make that easier, but I don't want it to be automatic. Estimating confidence or credible intervals typically involves some pretty big assumptions about the data. I don't want to make those assumptions for people.

Thanks for the thorough testing. I'll be happy to fix anything else you find.

john9631 commented 10 years ago

All going well up to bar including error bar. Please excuse the test of "confidence interval", I know the distribution issues but I need to show biologists what they could do if they wanted :)

The bar plot isnt drawing correctly for me. Here is a simple example plotting a line with it drawn on top by hand

zz = DataFrame(ix=1:10, y=1:10) plot(zz, x="ix", y="y", Geom.bar)

has the same affect. Have I got an earlier gadfly (0.1.20)?

Pkg.status()

Warning: using Base.Stat in module Stat conflicts with an existing identifier. Required packages:

dcjones commented 10 years ago

I forgot to tag a new version. After updating, you should be at 0.1.21 now.

john9631 commented 10 years ago

Got it thanks.
Sorry, I forgot that I hadnt tested smooth.
Its ok; I had a glitch but I can no longer reproduce it so it was probably a finger error.

I've been working on that draft and found a couple more issues.

  1. The histogram selected its bin sizes poorly which made me recall that you are typically encouraged to look at a number of either binwidths or bins settings to get a clearer perception of your data. Also Hadleys paper calls for that Can we have such a setting for histograms please? My personal preference is to specify bins.
  2. When it plots the error bars on points it used zero as the lower bound leaving a bit of spare white space. I can move the chart up and down with my cursor ... can I change it zoom?

With and without error bars

Thanks.

john9631 commented 10 years ago

The bar and error bar combine perfectly now. One thing that appears out of place is the AAABBB heading the color key. I assume Guide.colorkey just needs a "" default.

john9631 commented 10 years ago

Hopefully the last one. With the new bar plot, when coloration by categorical (either number or string) is introduced then some bars are dramatically extended. Exactly the same with vector or dataframe. This is a link to the data used.

Subplotting the barcharts works fine but if color = xgroup the same problem as shown in the picture occurs.

john9631 commented 10 years ago

Is there a pie chart option?

dcjones commented 10 years ago

Yeah, that looks pretty broken. I'll see what's going on.

There aren't pie charts yet. I'll add them eventually, but stacked bar charts normalized to 100% are often more readable, and easier to add at this point, so I'm going to do that first.

john9631 commented 10 years ago

I'll look forward to testing them and adding them to the reference and the tutorial.

john9631 commented 10 years ago

Low priority ones.

LP1. In preparation for stacked bar charts I was looking at the others. Is this the behaviour you expect here. Adding color made no difference.:

LP2. With boxplots I think a maximum width should be set (maybe 1.5x the span of the cross bars at high and low)

LP3. For standard line or point plots (others??) x could usefully default to 1:length(y) so that users don't have to figure it out.

plot(x=1:50, y=d_age)   ===>  plot( y=d_age)

LP4. I'm still getting min/max warnings if they're in your code. plot(x=1:size(d_age,1), y=d_age, Guide.xlabel("Respondent"), Guide.ylabel("Age"), Geom.errorbar, ymin=d_age-1.96_std(d_age), ymax=d_age+1.96_std(d_age), color=collect(d_sex), Guide.colorkey("Sex"), Geom.smooth, Geom.point)

generates: WARNING: min(x) is deprecated, use minimum(x) instead.

dcjones commented 10 years ago

The histogram selected its bin sizes poorly which made me recall that you are typically encouraged to look at a number of either binwidths or bins settings to get a clearer perception of your data. Also Hadleys paper calls for that Can we have such a setting for histograms please? My personal preference is to specify bins.

You can now manually set the number of bins, or put an upper or lower limit on the number of bins automatically selected. See the list of arguments here.

When it plots the error bars on points it used zero as the lower bound leaving a bit of spare white space. I can move the chart up and down with my cursor ... can I change it zoom?

To set the viewport manually you can now do something like this.

plot(x=rand(10), y=rand(10),
     Scale.x_continuous(minvalue=-1, maxvalue=1),
     Scale.y_continuous(minvalue=-5, maxvalue=5))

That's in the manual now as well under the scales section.

dcjones commented 10 years ago

Hopefully the last one. With the new bar plot, when coloration by categorical (either number or string) is introduced then some bars are dramatically extended. Exactly the same with vector or dataframe. This is a link to the data used.

The problem here is that Geom.bar assumes the data is already summarized. Since there are multiple rows in your data with the same age, these bars get stacked on top of each other, hence the extended bars. That's pretty weird and counter-intuitive, but I need to figure out what the right thing to do is.

In the mean time, you can get better results by using Geom.histogram and adding Scale.discrete_color.

john9631 commented 10 years ago

Ok. That makes sense. Box assumes that there is one y value for each x. This also causes an issue when you apply error bars (very nice by the way, and adjusting theme for zero width lets you match the style used in some articles) as you overlay one bar per point.

Maybe the solution is that the data has to be corrected first ... and the issue should be left exposed to remind the user that their data is richer than the method they're choosing. Sometimes an average would be right, other times a min or a max.

kmsquire commented 10 years ago

There aren't pie charts yet. I'll add them eventually, but stacked bar charts normalized to 100% are often more readable, and easier to add at this point, so I'm going to do that first.

Suggestion: don't implement pie charts. See http://www.perceptualedge.com/articles/08-21-07.pdf

timholy commented 10 years ago

I'm quite sympathetic to the idea of banning pie charts; I agree that bars are better for most purposes.

However, a mildly-interesting counterpoint: recently I had a referee specifically request a pie chart. Sometimes arguing is not worth the trouble it could cause, and it's better to just give them what they're asking for.

john9631 commented 10 years ago

If you want broad adoption you don't want to be the one persuading the customer that Beta is better than VHS. A specialist might buy the argument but your biologist will just go "what, no bar charts?"

kmsquire commented 10 years ago

If you want broad adoption you don't want to be the one persuading the customer that Beta is better than VHS. A specialist might buy the argument but your biologist will just go "what, no bar charts?"

I assume you meant "pie charts".

I work in a lab with biologists, and I'm forever attempting to get them to remove pie charts from their presentations... I've made progress, but there are some holdouts... ;-)

john9631 commented 10 years ago

Yes. You assume right - its the old "the customer may not be right; but he is the king" problem. ----- Original Message ----- From: Kevin Squire Sent: 11/19/13 11:13 AM To: dcjones/Gadfly.jl Subject: Re: [Gadfly.jl] Enhancements for biologists (#103)

If you want broad adoption you don't want to be the one persuading the customer that Beta is better than VHS. A specialist might buy the argument but your biologist will just go "what, no bar charts?"

I assume you meant "pie charts".

I work in a lab with biologists, and I'm forever attempting to get them to remove pie charts from their presentations... I've made progress, but there are some holdouts... ;-) — Reply to this email directly or view it on GitHub https://github.com/dcjones/Gadfly.jl/issues/103#issuecomment-28757018 .