GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 250 forks source link

Plotting DataArrays (and especially NA) #567

Open vchuravy opened 9 years ago

vchuravy commented 9 years ago

Since Gadfly is using DataFrames and DataArrays I was surprised that it seems not to handle NA

data = @data [1.0, NA, 2.0]
plot(x = 1:length(data), y = data)

In code where I currently rely on plotting NA values with Gadfly I define

Base.convert(::Type{Float64}, ::NAtype) = NaN

but that is a stopgap at best and it leads to #566

MaciekLeks commented 8 years ago

I do some comparison between Julia and R on basis of an online course (every exercise in R I try to do in Julia in addition). I carry out this for myself and I'm going to do a presentation in some organization (in order to encourage people to use Julia). I got stuck very quickly due to the NA problem. If I show people complete_cases function they'll start too lough very loudly. The same effect I suppose I would get for Base.convert... It's a pity.

timholy commented 8 years ago

@MaciekLeks, this is open source, so I bring you the good news that you are in charge of your own destiny :smile:. Have you tried defining the method that the error complains about?

julia> import Base: *

julia> (*)(::DataArrays.NAtype, ::Measures.Length{:mm,Float64}) = Measures.Length(:mm,NaN)
* (generic function with 309 methods)

Now try your plot.

You could even submit a pull request to add that to Compose!

MaciekLeks commented 8 years ago

@timholy, in my particular case I'm struggling with subtypes of AbstractString.

julia> plot(df, x="col", Geom.histogram) _Error showing value of type Gadfly.Plot: ERROR: MethodError: convert has no method matching convert(::Type{UTF8String}, ::DataArrays.NAtype) This may have arisen from a call to the constructor UTF8String(...), since type constructors fall back to convert methods. Closest candidates are: call{T}(::Type{T}, ::Any) convert(::Type{UTF8String}, ::UTF8String) convert(::Type{UTF8String}, ::ASCIIString) ... in collect at array.jl:247 in discretize at /home/user/.julia/v0.5/Gadfly/src/scale.jl:348 in discretize at /home/user/.julia/v0.5/Gadfly/src/scale.jl:342 in apply_scale at /home/user/.julia/v0.5/Gadfly/src/scale.jl:427 in apply_scales at /home/user/.julia/v0.5/Gadfly/src/scale.jl:39 in apply_scales at /home/user/.julia/v0.5/Gadfly/src/scale.jl:59 [inlined code] from dict.jl:473 in render_prepare at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:652 in render at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:718 [inlined code] from /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:822 in display at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:974 [inlined code] from expr.jl:8 in display at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:926 in print_response at REPL.jl:134 in print_response at REPL.jl:121 in anonymous at REPL.jl:624 in run_interface at ./LineEdit.jl:1610 in run_frontend at ./REPL.jl:864 in run_repl at ./REPL.jl:167 in start at ./client.jl:419 and calling your code for {AbstractString, ASCIIString, UTF8String} (*)(::DataArrays.NAtype, ::Measures.Length{:mm,<type>}) = Measures.Length(:mm,NaN) gives the same error.

Regarding your suggestions:

  1. Thank you for the immediate response. I appreciate it much.
  2. Why your proposal is better than the first one (using Base.convert)?
  3. I would pull the request if I get enough skills in that (I'm a child of SVN and at this moment I do not feel comfortable working on git)
timholy commented 8 years ago

@MaciekLeks, without the backtrace it's hard to know how to fix it: in particular, the first line of the backtrace shows the function that generated the error.

My proposal is more limited: @vchuravy is rightly concerned about converting all NA values into NaN, but the operation of multiplying by measures is more specific.

If you're uncomfortable submitting it, I can do it. If you figure out how to fix your specific error (or debug it a bit more), I can include your fix, too.

MaciekLeks commented 8 years ago

@timholy,

I've updated the trace, but also I'm giving more detail: a. The file to work on: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/reddit.csv b. The sample code which brings the mess:

using DataArrays
using DataFrames
using Gadfly 

reddit = readtable("<path>/reddit.csv", makefactors=true) #the creation of factors is important for me
plot(reddit, x="cheese", Geom.histogram)
timholy commented 8 years ago

Thanks, @MaciekLeks. I need to switch to my own work now, but either I'll get to this later or, if I may suggest a good debugging strategy: notice that the error is in the discretize function. You can debug this way:

julia> function Gadfly.discretize(values, levels=nothing, order=nothing, preserve_order=true)
    @show typeof(values) values
    @show typeof(levels) levels
    @show typeof(order) order
    # rest of discretize's code goes here, putting `Gadfly.` in front of non-exported methods (e.g., `Gadfly.discretize_make_pda`)
end

EDIT (crucial!): now run your plotting code. Look at the output that's generated. It will tell you what the inputs to discretize are, and this might help you fix the problem. In your case, levels must include an NA (perhaps among other values), and knowing what the container type is and what the other values are should help figure out how to fix it.

Perhaps it will be as simple as

Base.convert(::Type{UTF8String}, ::DataArrays.NAtype) = convert(UTF8String, "NA")
timholy commented 8 years ago

Oh, and also, a tip: julia 0.5 is really for "users willing to fix bugs themselves"---most package authors probably aren't supporting it yet (e.g., Gadfly's tests fail on 0.5, but pass on 0.4). If you need a stable release, I recommend julia 0.4.

timholy commented 8 years ago

Note I edited 2 posts up; I commented before it was complete.

MaciekLeks commented 8 years ago

Thank your for you tips. I'll follow them. I'll try to figure out what is going on here.

P.S. If I need a stable version I would have used .NET 4.0 instead ;)

timholy commented 8 years ago

:smile:

timholy commented 8 years ago

Oh, and the reason you need to start a fresh julia session is that you have to redefine discretize before any code that uses it has been called (compiled). Otherwise the old version gets compiled in, and your new version never executes.

MaciekLeks commented 8 years ago

Of course Base.convert(::Type{UTF8String}, ::DataArrays.NAtype) = convert(UTF8String, "NA") works, but it's not what we wanted.

I've walked through the instruction:

typeof(values) = DataArrays.PooledDataArray{UTF8String,UInt32,1}
values[end] = "Cheddar"
typeof(levels) = Void
levels = nothing
typeof(order) = Void
order = nothing
timholy commented 8 years ago

It that what it showed right above the error? The thing that doesn't make sense is that there's no hint of an NAtype anywhere.

MaciekLeks commented 8 years ago

Yes. That is the stuff before "Error: MethodError: convert..." comes.

MaciekLeks commented 8 years ago

collect in the line 353(da = discretize_make_pda(values, collect(eltype(values), levels))) cannot deal with conversion from NA to a string type, so it all comes down to convert.

Regarding levels = nothing at the beginning of the function. Why does it equal nothing? Should not it be taken out from PooledDataArray at DF?

MaciekLeks commented 8 years ago

@timholy, a couple of observations and questions:

  1. scale.jl (especially discretize) does not use existing PooledDataArray(PDA) in DataFrame(DM). Every time it is needed a new PDA is created. Is it by design or it's a bug?
  2. the preserve_order arg in discretize has got unknown goal. What is it for?
  3. There is a solution to cope with my NA problem without code changes (Scale.x_discrete(levels=...)): plot(reddit, x="age_range", Geom.histogram, Scale.x_discrete(levels=levels(reddit[:age_range])))
  4. nastring arg does not work as well: reddit = readtable(fname, makefactors=true,nastrings=["NA", "na", "n/a", "missing"])

Suggestion: To cope with NA problem for PooledDataArrays it is enough to add simple condition to check type of values in discretize and then choose one of the following path: a. da=values (make use of existing PDA) b. levels = DataArrays.levels(values); da = discretize_make_pda(values, collect(eltype(values), levels)) (every Gadfly's PDA is its own, there is no sharing PDA with DF) Which one is in accordance with the programming principles for scale.jl?

I could prepare a Pull Request but I feel there is something more. There is a hidden floor in this issue.