GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 251 forks source link

Automatic x range is expanded too wide #252

Open davidanthoff opened 10 years ago

davidanthoff commented 10 years ago

If I plot the following

plot(x=collect(1899:2401),y=collect(1899:2401),Geom.line)

I get a graph that has a range from 1800 to 2600 on the x axis and that essentially looks pretty ugly because that range is too large. A much prettier plot would have a range that only spans the actual data (so 1899:2401) and then it might have ticks at say 1900 to 2400 with step size 100 or so.

So, one suggestion and question: 1) it would be great if this could look better automatically and 2) is there a way to force the x min and max values on the plot? I tried Scale.x_continuous with minvalue and maxvalue, but if I set those to the actual range of the data it again displays this on an axis going from 1800 to 2600.

dcjones commented 10 years ago

You can get the effect you want by using 1900 and 2400 in x_continuous, like:

plot(x=collect(1899:2401), y=collect(1899:2401), Geom.line,
     Scale.x_continuous(minvalue=1900, maxvalue=2400),
     Scale.y_continuous(minvalue=1900, maxvalue=2400))

try

I agree that the automatic range and tick marks aren't always great. Here it doesn't work well because it has to choose values that span the data and (1900, 1400) doesn't span (1899, 2401). So slightly relaxing that constraint when choosing the range might be a good idea. I'll experiment with this.

davidanthoff commented 10 years ago

I had a look at optimize_ticks. That does not look like the original Wilkinson scoring, right? I now replaced that with a translation of an R package that implements the original Wilkinson scoring method and at least this example graph looks better. I'll clean up a bit and then you might have a look.

dcjones commented 10 years ago

It's Wilkinson's method as described in his book, with a couple tweaks. The R package may have a better implementation. I'd be happy to use that if so.

davidanthoff commented 10 years ago

I've spent some more time and I think to get this right requires more changes. Here are my current thoughts:

1) right now the coord adds padding around the range of the scale that makes a lot of graphs look weird and essentially makes it e.g. impossible to have the first tick align with the crossing axis. There is a comment in the code that says this is a kludge but needed to fit bar graphs on a discrete scale. I looked at the book again, and it seems to me the right approach on that one is to solve the discrete case problem in the discrete scale construction (this is described on p. 94 of the book) and then introduce no padding in the coord. 2) I think for a continuous scale if one passes in a min and max that should be a hard constraint, i.e. the algorithm shouldn't try to find ticks outside that range and then have the scale cover the ticks that bracket the min/max combo.

My sense is that with those two changes things would already look a bit better and if one wanted one could control things better. I'm happy to have a go at this, but right now there is one thing I don't understand: why is the optimize_ticks function called in the statistics? I guess I just don't understand the structure of the code well enough at this point, but I somehow had assumed that all of that stuff should be in the scale code.

dcjones commented 10 years ago

Thanks for taking the time to investigate this.

right now the coord adds padding around the range of the scale that makes a lot of graphs look weird

There's two separate kludges here: for discrete scales padding is added to prevent bar plots and boxplots from drawing outside the plot canvas. That should be done in a better way, like you say.

For continuous scales, padding is added so that the labels for the first tick on the x and y-axis aren't crowded together (that's that 0.03 * (xmax - xmin) term). I don't think that padding is especially weird, but could be handled better.

I think for a continuous scale if one passes in a min and max that should be a hard constraint

I agree. That would be easier to interpret than how it works now.

why is the optimize_ticks function called in the statistics?

Wilkinson includes choosing ticks as part of the scale. I felt like that muddied the concept a little, so tried to structure it differently. Tick generation is a statistics, since like other statistics it's a function that computes aesthetics from some other aesthetics. In this case computing xtick, ytick, etc, from x, y, etc. That's not definitely the right thing to do, but has some advantages.

felixjung commented 10 years ago

Any idea why this could be happening with my y-axis tick labels?

screen shot 2014-09-19 at 18 50 25

The plot is constructed using this code

    pl = plot(plot_data,
        layer(x = "s", y= "int", Geom.point, Theme(default_color = color_dot)),
        layer(x = "s", y = "int_curve", Geom.smooth(), Theme(default_color = color_line)),
        Scale.y_continuous(minvalue = y_min, maxvalue = y_max),
        Scale.x_continuous(minvalue = 0, maxvalue = 36),
        Guide.xticks(ticks = [0:6] * 6),
        plot_theme,
        Guide.Title("Firm $firm_id; $date_string"),
        Guide.ylabel(y_lab, orientation = :vertical),
        Guide.xlabel("Prediction horizon (months)")
    );

Thanks,

Felix

dcjones commented 10 years ago

I'm not sure. I've not managed to reproduce this yet. What is y_min, y_max set to here?

felixjung commented 10 years ago

I create a series of plots which are then displayed in an animation using the animate package in LaTeX. To force all plots to have the same y-axis, I determine the minimum and maximum y-values across all plots and set y_min and y_max accordingly for each plot (the plots are created in a loop).

UPDATE: I've checked the values for another example. The actual values are y_min = 1.539317798151983e-6 and y_max = 0.03169651300783998. Maybe I should round these to the second decimal? Looks to me like your automatic tic computation might be in trouble with these values? screen shot 2014-09-22 at 10 52 04

UPDATE: Flooring/ceiling to the second decimal fixed the issue in the above example. Rounding to the fourth decimal did not and resulted in the same problem with tick labels. screen shot 2014-09-22 at 11 00 57

felixjung commented 10 years ago

Unfortunately, my attempts to fix this for all my plots have failed so far. Not having looked at the Gadfly code, yet, my understanding of the problem is the following:

  1. I pass the minvalue and maxvalue parameters to Scale.y_continuous()
  2. The plot pane (in my examples the black border rectangle) will adhere to my parameters
  3. Your code tries to find good intervals for the grid. When, for example, minvalue = 0.01 and maxvalue = 0.41 your algorithm will determine the appropriate ticks as [0:0.1:0.5]. The highest tick value, however, is larger than my maxvalue. Gadfly will ignore this and simply print the labels for all tick values. It should however remove any tick values larger than my maxvalue parameter.

I'll have a look at the code now.

screen shot 2014-09-22 at 16 31 18

felixjung commented 10 years ago

Man I wish I didn't find it so hard to understand the Gadfly codebase. It's so opaque to me :( I'm never quite sure which code is responsible for what and there are so many calls to functions that provide other functions in some anonymous way :(

Here are some questions:

  1. I assume
        if scale.minvalue != nothing
            if scale.vars === x_vars
                aes.xviewmin = scale.trans.f(scale.minvalue)
            elseif scale.vars === y_vars
                aes.yviewmin = scale.trans.f(scale.minvalue)
            end
        end

        if scale.maxvalue != nothing
            if scale.vars === x_vars
                aes.xviewmax = scale.trans.f(scale.maxvalue)
            elseif scale.vars === y_vars
                aes.yviewmax = scale.trans.f(scale.maxvalue)
            end
        end

in scale.jl assigns the minimum and maximum values used when determining how much of the plot is visible (i.e. the panel)? However, somehow some margins still need to be applied to these values.

  1. Are the aes.yviewmin and aes.yviewmax properties used to determine the aes.ytickvisible property in guide.jl? Where does the "aesyticksvisible" get assigned?
  2. Is the aes.ytickvisible property (containing all visible ticks?) responsible for which tick labels are drawn when using any of the static backends? I realise that you actually compute more ticks than those that are visible in the static backends. This allows you to zoom in/out in the JS backend.

Maybe you can point me in the right direction. I wish I had a better understanding of the code base. Sorry, I can't be of more help.

dcjones commented 10 years ago

Thanks for trying to debug this. Obviously I haven't documented the codebase at a high level, so it can be pretty intimidating to wade into.

I still haven't figured this out, and can't provoke it into reproducing this. (Is there any way you could post an example with data that causes this?) I can try to point you in the right direction:

dcjones commented 10 years ago

While trying to explain the code to you, I potentially found the problem. Could you checkout master and see if there's any difference?

felixjung commented 10 years ago

Sorry, I was on vacation and am now busy with everything but research. I'll take a look soon. Thanks a lot!

felixjung commented 10 years ago

Hi, unfortunately this did not fix the problem. I'll try to give you a minimal example soon. Thanks for the effort!

smason commented 9 years ago

I've just found this issue, not sure why I didn't see it before posting to stackoverflow. I've written some code to generate what seem to be better axis ticks:

http://stackoverflow.com/questions/28943866/r-style-axis-ticks-with-gadfly-jl

Not sure if this is helpful, but providing some way of changing the tick method may be nice.