Open Mattriks opened 6 years ago
Thanks for capturing this. We actually discussed this behavior that results in the error above and we weren't sure what the right solution was. The handling of missing values is often not easy to spot in the code and I wouldn't be surprised if there were other issues with missing values. It probably isn't too hard to get the old behavior back (if we want it) but, as mentioned previously, it would probably be great if the handling of missing values was more centralized if possible.
With a dataset containing 7 categories, if I use levels
to specify 4 categories e.g. Scale.x_discrete(levels=ux[1:4])
(as above) then the minimum behaviour I would expect is a plot with 4 levels, which is currently not what happens on master.
The NA label is added (by Gadfly) to the 3 levels I didn't specify (its not part of the raw data), which then becomes the first boxplot in the plots above. I'm also undecided whether that should be part of the default behaviour. But as mentioned, the minimum behaviour should be a plot with the 4 levels I specified.
perhaps related, and sorry to hijack this issue if not, but the Geom.segment example is throwing the following error on master now:
ERROR: MethodError: no method matching atan2(::Array{Union{Float64, Missings.Missing},1}, ::Array{Union{Float64, Missings.Missing},1})
Let's file a separate issue for the atan2
+Missings
issue. There might be a few more of these showing up as we battle test the new version.
So what would be the best behavior? Show a "missing" category or hide it? I'm not familiar with Gadfly, but it looks weird to group all values which do not appear in the specified levels in a "missing" category, while missing values are not represented by default.
I think that showing the "Other" boxplot (i.e. the categories I didn't specify in levels
), should be an optional behaviour. There are cases where having an "Other" boxplot could be useful.
But for now, I would like to see the minimum behaviour (as described above) work on master.
Another issue I encountered on Gadfly master is a 255 "group" limit, I don't recall hitting a limit with PooledDataArray. This is potentially an issue for polygon-based maps:
using DataFrames, VoronoiDelaunay
srand(123)
tess = DelaunayTessellation()
width = max_coord - min_coord
a= Point.(min_coord+rand(130)*width, min_coord+rand(130)*width)
push!(tess, a)
# In the next line, change 256 to 257 and the plots fail
triangles = [geta.(tess._trigs) getb.(tess._trigs) getc.(tess._trigs)][2:256,:]
n = size(triangles, 1)
D = DataFrame(x=vec(getx.(triangles)'), y=vec(gety.(triangles)'), id=string.(vec([1:n 1:n 1:n]')))
pa = plot(D, x=:x, y=:y, Geom.polygon(fill=true, preserve_order=true), color=:id, Theme(key_position=:none))
pb = plot(D, x=:x, y=:y, Geom.polygon(preserve_order=true), group=:id)
hstack(pa,pb)
Good catch, this is due to IndirectArray
defaulting to UInt8
storage. These lines (and those below) need to be adapted to use UInt32
:
https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-f5bddbe411e6a7803f1f9fda16c89259R19
https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-18d9bfaa1079b1be1c4db9714ec3a5e3R266
https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-18d9bfaa1079b1be1c4db9714ec3a5e3R431
https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-9ec506bf78232ae17d082c22c2e66449R404
https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-9ec506bf78232ae17d082c22c2e66449R1159
BTW, it would make sense to check that the same storage type is used everywhere, to avoid recompiling lots of methods down the line. For example, this line uses Int
and would probably better convert the input to UInt32
:
https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-18d9bfaa1079b1be1c4db9714ec3a5e3R292
@andreasnoack it'd be nice to tag 0.7 at some point, but this issue needs resolved first. it's not clear to me what the best solution is to the NA problem above, but at least the path is clear for the UInt8 problem. since it's related to your work on IndirectArrays, would you mind taking care of at least the latter? the only suggestion i'd make here is that perhaps UInt16 would suffice. not sure of the performance implications.
Bump. @andreasnoack, it would be great to have this fixed!
If there's anything we can do to help, let us know.
Indeed. I think I know how to fix these issues but I need the time and, unfortunately, other tasks take priority right now.
we should think about whether we want to get these dataframes related regressions fixed before or after we merge any breaking changes. ideally it'd be nice to have a 0.7.x which completely worked. on the otherhand, no sense delaying progress just because of semantic versioning.
Using @Mattriks' code from above, I was able to verify that master fails to plot when using 257 in the code above
ERROR: LoadError: InexactError()
Stacktrace:
[1] copy!(::IndexLinear, ::Array{UInt8,1}, ::IndexLinear, ::Array{Int64,1}) at ./abstractarray.jl:656
[2] discretize_make_ia(::Array{String,1}, ::Array{String,1}) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/misc.jl:397
[3] discretize(::Array{String,1}, ::Array{String,1}, ::Void, ::Bool) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:269
[4] discretize(::Array{String,1}, ::Array{String,1}) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:258
[5] apply_scale(::Gadfly.Scale.DiscreteColorScale, ::Array{Gadfly.Aesthetics,1}, ::Gadfly.Data, ::Vararg{Gadfly.Data,N} where N) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:477
[6] apply_scales(::IterTools.Distinct{Base.ValueIterator{Dict{Symbol,Gadfly.ScaleElement}},Gadfly.ScaleElement}, ::Array{Gadfly.Aesthetics,1}, ::Gadfly.Data, ::Vararg{Gadfly.Data,N} where N) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:34
[7] apply_scales(::IterTools.Distinct{Base.ValueIterator{Dict{Symbol,Gadfly.ScaleElement}},Gadfly.ScaleElement}, ::Gadfly.Data) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:53
[8] render_prepare(::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:677
[9] render(::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:757
[10] collect(::Base.Generator{Tuple{Gadfly.Plot,Gadfly.Plot},Gadfly.##110#111}) at ./array.jl:475
[11] hstack(::Gadfly.Plot, ::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:923
[12] include_from_node1(::String) at ./loading.jl:576
[13] include(::String) at ./sysimg.jl:14
and that @nalimilan's solution seems to fix that error.
discretize_make_ia(values::AbstractVector, levels) =
- IndirectArray(Array{UInt8}(indexin(values, levels)), levels)
+ IndirectArray(Array{UInt}(indexin(values, levels)), levels)
But there's still the issue of the tiling breaking at the bottom right of the plot. Not sure what is going on there.
@andreasnoack julia 0.7 is just around the corner and it'd be nice to have this issue resolved before dropping support for 0.6. will you have time soon? i could try to take a stab, but you are much more familiar with the changes you've made to support DataFrames. thanks.
I'll try to take a look during July but can't promise anything.
@andreasnoack just a friendly bump. i'd like to start working on 0.7, but don't want to drop 0.6 support before this issue is resolved.
would be nice to resolve this regression now that gadfly dependencies work on julia 0.7. anyone want to volunteer? :)
As I noted above, it shouldn't be hard to fix at least some/most of the issues by using UInt
explicitly everywhere IndirectArray
is used. It should be trivial for somebody familiar with the Gadfly code.
According to the docs, the
levels
argument ofScale.x-discrete
behaves such that "Order will be respected and anything in the data that's not represented in levels will be set to NA".This is what I get on
Gadfly 0.6.5
:so it works as the docs suggest, but with obvious issues in the top plot, and maybe users would prefer for the NA column not to appear (or to appear only if NA is added to the levels vector)?
And on master (for both plots):