Open vchuravy opened 9 years ago
I do some comparison between Julia and R on basis of an online course (every exercise in R I try to do in Julia in addition). I carry out this for myself and I'm going to do a presentation in some organization (in order to encourage people to use Julia). I got stuck very quickly due to the NA
problem. If I show people complete_cases
function they'll start too lough very loudly. The same effect I suppose I would get for Base.convert
... It's a pity.
@MaciekLeks, this is open source, so I bring you the good news that you are in charge of your own destiny :smile:. Have you tried defining the method that the error complains about?
julia> import Base: *
julia> (*)(::DataArrays.NAtype, ::Measures.Length{:mm,Float64}) = Measures.Length(:mm,NaN)
* (generic function with 309 methods)
Now try your plot.
You could even submit a pull request to add that to Compose!
@timholy, in my particular case I'm struggling with subtypes of AbstractString
.
julia> plot(df, x="col", Geom.histogram)
_Error showing value of type Gadfly.Plot:
ERROR: MethodError: convert
has no method matching convert(::Type{UTF8String}, ::DataArrays.NAtype)
This may have arisen from a call to the constructor UTF8String(...),
since type constructors fall back to convert methods.
Closest candidates are:
call{T}(::Type{T}, ::Any)
convert(::Type{UTF8String}, ::UTF8String)
convert(::Type{UTF8String}, ::ASCIIString)
...
in collect at array.jl:247
in discretize at /home/user/.julia/v0.5/Gadfly/src/scale.jl:348
in discretize at /home/user/.julia/v0.5/Gadfly/src/scale.jl:342
in apply_scale at /home/user/.julia/v0.5/Gadfly/src/scale.jl:427
in apply_scales at /home/user/.julia/v0.5/Gadfly/src/scale.jl:39
in apply_scales at /home/user/.julia/v0.5/Gadfly/src/scale.jl:59
[inlined code] from dict.jl:473
in render_prepare at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:652
in render at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:718
[inlined code] from /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:822
in display at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:974
[inlined code] from expr.jl:8
in display at /home/user/.julia/v0.5/Gadfly/src/Gadfly.jl:926
in print_response at REPL.jl:134
in print_response at REPL.jl:121
in anonymous at REPL.jl:624
in run_interface at ./LineEdit.jl:1610
in run_frontend at ./REPL.jl:864
in run_repl at ./REPL.jl:167
in start at ./client.jl:419
and
calling your code for {AbstractString
, ASCIIString
, UTF8String
}
(*)(::DataArrays.NAtype, ::Measures.Length{:mm,<type>}) = Measures.Length(:mm,NaN)
gives the same error.
Regarding your suggestions:
Base.convert
)?@MaciekLeks, without the backtrace it's hard to know how to fix it: in particular, the first line of the backtrace shows the function that generated the error.
My proposal is more limited: @vchuravy is rightly concerned about converting all NA values into NaN, but the operation of multiplying by measures is more specific.
If you're uncomfortable submitting it, I can do it. If you figure out how to fix your specific error (or debug it a bit more), I can include your fix, too.
@timholy,
I've updated the trace, but also I'm giving more detail: a. The file to work on: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/reddit.csv b. The sample code which brings the mess:
using DataArrays
using DataFrames
using Gadfly
reddit = readtable("<path>/reddit.csv", makefactors=true) #the creation of factors is important for me
plot(reddit, x="cheese", Geom.histogram)
Thanks, @MaciekLeks. I need to switch to my own work now, but either I'll get to this later or, if I may suggest a good debugging strategy: notice that the error is in the discretize
function. You can debug this way:
using Gadfly
discretize
function to show debugging info:julia> function Gadfly.discretize(values, levels=nothing, order=nothing, preserve_order=true)
@show typeof(values) values
@show typeof(levels) levels
@show typeof(order) order
# rest of discretize's code goes here, putting `Gadfly.` in front of non-exported methods (e.g., `Gadfly.discretize_make_pda`)
end
EDIT (crucial!): now run your plotting code. Look at the output that's generated. It will tell you what the inputs to discretize
are, and this might help you fix the problem. In your case, levels
must include an NA
(perhaps among other values), and knowing what the container type is and what the other values are should help figure out how to fix it.
Perhaps it will be as simple as
Base.convert(::Type{UTF8String}, ::DataArrays.NAtype) = convert(UTF8String, "NA")
Oh, and also, a tip: julia 0.5 is really for "users willing to fix bugs themselves"---most package authors probably aren't supporting it yet (e.g., Gadfly's tests fail on 0.5, but pass on 0.4). If you need a stable release, I recommend julia 0.4.
Note I edited 2 posts up; I commented before it was complete.
Thank your for you tips. I'll follow them. I'll try to figure out what is going on here.
P.S. If I need a stable version I would have used .NET 4.0 instead ;)
:smile:
Oh, and the reason you need to start a fresh julia session is that you have to redefine discretize
before any code that uses it has been called (compiled). Otherwise the old version gets compiled in, and your new version never executes.
Of course Base.convert(::Type{UTF8String}, ::DataArrays.NAtype) = convert(UTF8String, "NA")
works, but it's not what we wanted.
I've walked through the instruction:
typeof(values) = DataArrays.PooledDataArray{UTF8String,UInt32,1}
values[end] = "Cheddar"
typeof(levels) = Void
levels = nothing
typeof(order) = Void
order = nothing
It that what it showed right above the error? The thing that doesn't make sense is that there's no hint of an NAtype
anywhere.
Yes. That is the stuff before "Error: MethodError: convert
..." comes.
collect
in the line 353(da = discretize_make_pda(values, collect(eltype(values), levels))
)
cannot deal with conversion from NA to a string type, so it all comes down to convert
.
Regarding levels = nothing
at the beginning of the function. Why does it equal nothing
? Should not it be taken out from PooledDataArray at DF?
@timholy, a couple of observations and questions:
discretize
) does not use existing PooledDataArray(PDA) in DataFrame(DM). Every time it is needed a new PDA is created. Is it by design or it's a bug?preserve_order
arg in discretize
has got unknown goal. What is it for?Scale.x_discrete(levels=...)
): plot(reddit, x="age_range", Geom.histogram, Scale.x_discrete(levels=levels(reddit[:age_range])))
nastring
arg does not work as well: reddit = readtable(fname, makefactors=true,nastrings=["NA", "na", "n/a", "missing"])
Suggestion:
To cope with NA problem for PooledDataArrays it is enough to add simple condition to check type of values
in discretize
and then choose one of the following path:
a. da=values
(make use of existing PDA)
b. levels = DataArrays.levels(values); da = discretize_make_pda(values, collect(eltype(values), levels))
(every Gadfly's PDA is its own, there is no sharing PDA with DF)
Which one is in accordance with the programming principles for scale.jl?
I could prepare a Pull Request but I feel there is something more. There is a hidden floor in this issue.
Since Gadfly is using DataFrames and DataArrays I was surprised that it seems not to handle
NA
In code where I currently rely on plotting
NA
values with Gadfly I definebut that is a stopgap at best and it leads to #566