GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 250 forks source link

Scale.xdiscrete levels #1130

Open Mattriks opened 6 years ago

Mattriks commented 6 years ago

According to the docs, the levels argument of Scale.x-discrete behaves such that "Order will be respected and anything in the data that's not represented in levels will be set to NA".

andreasnoack commented 6 years ago

Thanks for capturing this. We actually discussed this behavior that results in the error above and we weren't sure what the right solution was. The handling of missing values is often not easy to spot in the code and I wouldn't be surprised if there were other issues with missing values. It probably isn't too hard to get the old behavior back (if we want it) but, as mentioned previously, it would probably be great if the handling of missing values was more centralized if possible.

Mattriks commented 6 years ago

With a dataset containing 7 categories, if I use levels to specify 4 categories e.g. Scale.x_discrete(levels=ux[1:4]) (as above) then the minimum behaviour I would expect is a plot with 4 levels, which is currently not what happens on master.

The NA label is added (by Gadfly) to the 3 levels I didn't specify (its not part of the raw data), which then becomes the first boxplot in the plots above. I'm also undecided whether that should be part of the default behaviour. But as mentioned, the minimum behaviour should be a plot with the 4 levels I specified.

bjarthur commented 6 years ago

perhaps related, and sorry to hijack this issue if not, but the Geom.segment example is throwing the following error on master now:

ERROR: MethodError: no method matching atan2(::Array{Union{Float64, Missings.Missing},1}, ::Array{Union{Float64, Missings.Missing},1})
andreasnoack commented 6 years ago

Let's file a separate issue for the atan2+Missings issue. There might be a few more of these showing up as we battle test the new version.

nalimilan commented 6 years ago

So what would be the best behavior? Show a "missing" category or hide it? I'm not familiar with Gadfly, but it looks weird to group all values which do not appear in the specified levels in a "missing" category, while missing values are not represented by default.

Mattriks commented 6 years ago

I think that showing the "Other" boxplot (i.e. the categories I didn't specify in levels), should be an optional behaviour. There are cases where having an "Other" boxplot could be useful.

But for now, I would like to see the minimum behaviour (as described above) work on master.

Mattriks commented 6 years ago

Another issue I encountered on Gadfly master is a 255 "group" limit, I don't recall hitting a limit with PooledDataArray. This is potentially an issue for polygon-based maps:

using DataFrames, VoronoiDelaunay

srand(123)
tess = DelaunayTessellation()
width = max_coord - min_coord
a= Point.(min_coord+rand(130)*width, min_coord+rand(130)*width) 
push!(tess, a)

# In the next line, change 256 to 257 and the plots fail
triangles = [geta.(tess._trigs) getb.(tess._trigs) getc.(tess._trigs)][2:256,:]
n = size(triangles, 1)
D = DataFrame(x=vec(getx.(triangles)'), y=vec(gety.(triangles)'), id=string.(vec([1:n 1:n 1:n]')))

pa = plot(D, x=:x, y=:y, Geom.polygon(fill=true, preserve_order=true), color=:id, Theme(key_position=:none))
 pb = plot(D, x=:x, y=:y, Geom.polygon(preserve_order=true), group=:id) 
hstack(pa,pb)

delaunay

nalimilan commented 6 years ago

Good catch, this is due to IndirectArray defaulting to UInt8 storage. These lines (and those below) need to be adapted to use UInt32: https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-f5bddbe411e6a7803f1f9fda16c89259R19 https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-18d9bfaa1079b1be1c4db9714ec3a5e3R266 https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-18d9bfaa1079b1be1c4db9714ec3a5e3R431 https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-9ec506bf78232ae17d082c22c2e66449R404 https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-9ec506bf78232ae17d082c22c2e66449R1159

BTW, it would make sense to check that the same storage type is used everywhere, to avoid recompiling lots of methods down the line. For example, this line uses Int and would probably better convert the input to UInt32: https://github.com/GiovineItalia/Gadfly.jl/pull/1090/files#diff-18d9bfaa1079b1be1c4db9714ec3a5e3R292

bjarthur commented 6 years ago

@andreasnoack it'd be nice to tag 0.7 at some point, but this issue needs resolved first. it's not clear to me what the best solution is to the NA problem above, but at least the path is clear for the UInt8 problem. since it's related to your work on IndirectArrays, would you mind taking care of at least the latter? the only suggestion i'd make here is that perhaps UInt16 would suffice. not sure of the performance implications.

tlnagy commented 6 years ago

Bump. @andreasnoack, it would be great to have this fixed!

If there's anything we can do to help, let us know.

andreasnoack commented 6 years ago

Indeed. I think I know how to fix these issues but I need the time and, unfortunately, other tasks take priority right now.

bjarthur commented 6 years ago

we should think about whether we want to get these dataframes related regressions fixed before or after we merge any breaking changes. ideally it'd be nice to have a 0.7.x which completely worked. on the otherhand, no sense delaying progress just because of semantic versioning.

tlnagy commented 6 years ago

Using @Mattriks' code from above, I was able to verify that master fails to plot when using 257 in the code above

ERROR: LoadError: InexactError()
Stacktrace:
 [1] copy!(::IndexLinear, ::Array{UInt8,1}, ::IndexLinear, ::Array{Int64,1}) at ./abstractarray.jl:656
 [2] discretize_make_ia(::Array{String,1}, ::Array{String,1}) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/misc.jl:397
 [3] discretize(::Array{String,1}, ::Array{String,1}, ::Void, ::Bool) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:269
 [4] discretize(::Array{String,1}, ::Array{String,1}) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:258
 [5] apply_scale(::Gadfly.Scale.DiscreteColorScale, ::Array{Gadfly.Aesthetics,1}, ::Gadfly.Data, ::Vararg{Gadfly.Data,N} where N) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:477
 [6] apply_scales(::IterTools.Distinct{Base.ValueIterator{Dict{Symbol,Gadfly.ScaleElement}},Gadfly.ScaleElement}, ::Array{Gadfly.Aesthetics,1}, ::Gadfly.Data, ::Vararg{Gadfly.Data,N} where N) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:34
 [7] apply_scales(::IterTools.Distinct{Base.ValueIterator{Dict{Symbol,Gadfly.ScaleElement}},Gadfly.ScaleElement}, ::Gadfly.Data) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/scale.jl:53
 [8] render_prepare(::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:677
 [9] render(::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:757
 [10] collect(::Base.Generator{Tuple{Gadfly.Plot,Gadfly.Plot},Gadfly.##110#111}) at ./array.jl:475
 [11] hstack(::Gadfly.Plot, ::Gadfly.Plot) at /Users/tamasnagy/.julia/v0.6/Gadfly/src/Gadfly.jl:923
 [12] include_from_node1(::String) at ./loading.jl:576
 [13] include(::String) at ./sysimg.jl:14

and that @nalimilan's solution seems to fix that error.

 discretize_make_ia(values::AbstractVector, levels) =
-    IndirectArray(Array{UInt8}(indexin(values, levels)), levels)
+    IndirectArray(Array{UInt}(indexin(values, levels)), levels)

image

But there's still the issue of the tiling breaking at the bottom right of the plot. Not sure what is going on there.

bjarthur commented 6 years ago

@andreasnoack julia 0.7 is just around the corner and it'd be nice to have this issue resolved before dropping support for 0.6. will you have time soon? i could try to take a stab, but you are much more familiar with the changes you've made to support DataFrames. thanks.

andreasnoack commented 6 years ago

I'll try to take a look during July but can't promise anything.

bjarthur commented 6 years ago

@andreasnoack just a friendly bump. i'd like to start working on 0.7, but don't want to drop 0.6 support before this issue is resolved.

bjarthur commented 6 years ago

would be nice to resolve this regression now that gadfly dependencies work on julia 0.7. anyone want to volunteer? :)

nalimilan commented 6 years ago

As I noted above, it shouldn't be hard to fix at least some/most of the issues by using UInt explicitly everywhere IndirectArray is used. It should be trivial for somebody familiar with the Gadfly code.