[BUG] Scatter plots silently plot wrong data when NaNs are present

niclasmattsson commented 3 years ago

Details

When you plot an XY scatter plot with additional dimensions of data in markers or colors, any NaNs in the XY data will shift the markers and colors and cause Plots to display incorrect data.

I hope this gets immediate attention because I think this is a super serious problem. I very nearly submitted a paper with completely messed up results because of this.

Demo (correct): note that the third circle is large and green.

julia> scatter([1,2,3], [3,2,1], markersize=[30,10,30], c=[1,2,3], xlims=(0,4), ylims=(0,4), legend=false)

newplot (2)

Now change a coordinate for circle 2 to NaN and note what happens with circle 3.

julia> scatter([1,2,3], [3,NaN,1], markersize=[30,10,30], c=[1,2,3], xlims=(0,4), ylims=(0,4), legend=false)

newplot (3)

Backends

This bug occurs on ( insert x below )

Backend	yes	untested
gr (default)	X
pyplot		X
plotly	X
plotlyjs		X
pgfplotsx		X
inspectdr		X

Versions

Plots.jl version: 1.10.2 Backend version (]st -m): Output of versioninfo():

julia> versioninfo()
Julia Version 1.6.0-beta1.0
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, broadwell)
Environment:
  JULIA_EDITOR = "C:\Program Files\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS = 4
  JULIA_PKG_DEVDIR = C:/Stuff

daschw commented 3 years ago

I can see how this is unexpected and that is really unfortunate. However, I am afraid that this is the expected behavior, because NaNs are used in Plots to seperate series segments as illustrated in reference image 49. Changing this would also break the behavior described in http://docs.juliaplots.org/latest/input_data/#Unconnected-Data-within-same-groups.

niclasmattsson commented 3 years ago

Yes, I've used that segment functionality and found it very convenient. But now that this bug/feature bit me really hard I think something needs to change. I had no lines in my figure so segment breaks weren't even on the radar in my case. Further, NaNs naturally arise in scientific computations and since they propagate correctly without errors (and plot invisibly as expected), many times people don't bother detecting and handling them. Having Plots silently shift other dimensions when NaNs are present is just too dangerous behavior for the main visualization tool in a computations-oriented language. So let me begin brainstorming how to resolve this:

Could Plots check the lengths of all the data dimensions and throw an error or warning when they don't match? In my example above there are unused elements in the marker and color vectors.
Similarly, maybe Plots could require that when NaNs are used as segment breaks, there must be NaNs in that position in all data dimensions?
Maybe Plots could come up with some other marker to indicate segment breaks?

RodolfoFigueroa commented 3 years ago

I have to agree. The segment functionality is very handy, but having Plots shift parameter arrays without so much as a warning is extremely dangerous. I'm currently combing through my old code to see if this affected any of my plots, because I had no idea this "feature" existed until now.

mkborregaard commented 3 years ago

I think @niclasmattsson first suggestion:

Could Plots check the lengths of all the data dimensions and throw an error or warning when they don't match? In my example above there are unused elements in the marker and color vectors.

sounds like a good solution.

daschw commented 3 years ago

If, I'm not mistaken, then we would not be able to allow automatic cycling anymore, like in

scatter(rand(6), color=[:red, :blue, :green], marker=[:square, :circle])

I'm not sure how widely these things are used and maybe it would be better to be a little bit more restrictive regarding the input in Plots and I am open to discuss this, but a change like this would be really breaking and could only happen in Plots 2.0

niclasmattsson commented 3 years ago

How about semiautomatic cycling then? :)

scatter(rand(6), color=[:red, :blue, :green], marker=[:square, :circle], cycle=true)

That keyword argument flag suggests a fourth idea for my brainstorm list: maybe interpreting NaNs as segment breaks must be enabled by specifically adding nanbreaks=true?

I understand that any breaking change has to wait for the next major release. In any case, thanks for listening and taking this seriously.

mkborregaard commented 3 years ago

Relevant issues https://github.com/JuliaPlots/Plots.jl/issues/2980 https://github.com/JuliaPlots/Plots.jl/issues/1325 https://github.com/JuliaPlots/Plots.jl/issues/1151 (mysteriously closed)

yha commented 3 years ago

I've just found this issue report, after writing a fix (I'll send the PR soon, after adding some tests). I've been bitten by this several times recently. I think the current behavior is bad enough to warrant fixing at the price of a breaking change. NaNs can appear in data "at random" so it should not affect how indexes in data correspond to indexes in attributes. It's seems to more much more likely to silently produce wrong plots than to be used intentionally. Also, this behavior seems to be somewhat new: I see that it was introduced in #2940 (31 August 2020), and before that there was no consistency among backends. According to the "before" example there, the pyplot backend was already doing the right thing (at least for the markershape and color attributes) before that PR. So I think this should be considered a regression to be fixed rather than intended behavior.

JuliaPlots / Plots.jl