Closed pdeffebach closed 2 months ago
In Julia missing
doesn't generally get dropped implicitly because the presence of a missing can be important information.
julia> sum([1,2, missing])
missing
As with sum
, I prefer plots to reflect the existence of missing
s in the data. Otherwise I am at risk of massively misinterpreting my data. Having a simple way of skipping missing values could be nice, but I would like it to be opt-in.
@jariji
This isn't quite right. Even though, in Julia sum([1, 2, missing])
returns missing
, existing libraries don't treat missing
strictly. Both Plots.jl and Makie.jl omit missing
values in arguments before plotting (similar to how GLM omits missing
before running a regression).
Going further I conducted a survey of Julia, R, Python, and Octave (because I don't have a Matlab installation) to better understand how each library treats missing values. I created a dataset in Julia of ages and wages for individuals. The :wage
variable is missing for 10% of the population. For each language I assessed (1) how missing
is treated in mean(x)
where x
contains missing
values, and more importantly (2) how missing
values are handled during plotting.
As you can see, in all languages and plotting libraries, missing
values are dropped before plotting. So if AlgebraOfGraphics.jl were to require separate handling of missing values, it would break with existing standards and expectations for plotting libraries.
Note that this is true even in languages where missing
values are propagated in mean(df.age)
, as Julia does. (The only framework which ignores missing values in mean
is pandas).
Also note that in all example I show, I use the closest possible value to missing
. In R I use NA
, in Pandas I use pd.NA
, and in Octave I use NA
. The framework I test which does not include NA
is Numpy, which only supports NaN
.
Turning back to the earlier fact that missing
is omitted in Makie.jl, my guess is that current behavior could probably be considered a bug, and is the result of an unnecessary <:Real
dispatch somewhere in AlgrebraOfGraphics.jl. However I can't find it at the moment. @SimonDanisch, do you know where AlgebraOfGraphics.jl might be treating a Union{Missing, <:Real}
vector as categorical?
Below are my implementations:
ggplot2 does make an NA
bar
> data.frame(a = c('one', 'two', 'two', NA)) %>%
ggplot(aes(a)) +
geom_bar()
But I don't think prior art is the way to go here. A plot is a data summary, just like mean
is, and there isn't really a prinicpled justification for removing unknown values for plotting specifically. The consistency argument favors retaining them --- the argument for dropping data is about pragmatics. I think we should find a simple way to satisfy users who want missing data hidden, but that shouldn't compromise the expectations of users who don't.
That doesn't address my point. Your example is about how to handle missing
when it is part of a categorical vector. I am talking about how to make a plot where I want a column to be treated as a continuous vector, but it contains missing
.
Matplotlib definitely doesn't just "ignore" nans, compare how the 2nd and 3rd lines here look in the plot:
import matplotlib.pyplot as plt
import numpy as np
plt.plot([1, 2, 3, 4, 5], [2, 1, 3, 5, 4])
plt.plot([1, 2, 4, 5], np.array([2, 1, 5, 4]) + 1) # drop obs #3
plt.plot([1, 2, np.nan, 4, 5], np.array([2, 1, 3, 5, 4]) + 2) # replace obs #3 with nan
Also, even if/when it does ignore nans, this behavior can be used as a consistency argument for handling NaN
in Julia, not missing
. And indeed, for numeric values, NaN
in Julia is the most well-specified, widely propagated, type-stable, performant, interoperable object to put when the actual value is not available.
TBH, I'm not that worried about this specific case of a plotting library – just concerned about the general direction towards silently ignoring a part of data without user explicitly asking or knowing about that.
In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like
Julia: language where data handling libraries have safe defaults 🌈
@aplavin I would still say "ignore" is the correct word. Whether there a gap in the line or it is filled in is not super material to me. The point is that the graph "works", the code runs, and the lines are still treated as continuous variables.
Please look at how AlgebraOfGraphics.jl handles this scenario. The current behavior is very clearly a bug and could not reasonably be the intention of anyone making a plot.
In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like
This solution would make my life not better than current behavior, as I would have to turn off these extras any time I wanted to make a plot.
I will work on a PR to change current behavior to match other plotting libraries.
The point is that the graph "works", the code runs
That attitude easily leads to silent correctness issues. Agree with @jariji, developing convenient but explicit and opt-in ways to handle missings is a much better solution.
Luckily, lots of Julia functions return nothing
to indicate "no value", which is not susceptible to such issues. Still, would be nice to have an alternative (such as missing
) that always propagates whenever possible – but still is never silently ignored.
I want to re-emphasize that I explored the behavior of R, Python, Octave and other Julia libraries and all of them seamlessly create graphs with missing values. AlgebraOfGraphics.jl is the odd one out.
@SimonDanisch hopefully you can help with a PR when I open one.
@aplavin @jariji I will not be continuing this conversation, as it is not useful to go back and forth.
Hey, I comment because I need a simple way to tell AlgebraOfGraphics that my data is indeed continuous and not categorical, just because it contains missing
values. I agree that silently ignoring missing
value could lead to misleading graphics, but I cannot imagine a case where treating Union{Missing, Float64}
as categorical is actually what someone needs.
Maybe introducing something like nonnumerical
would help? mapping(:a, :b => continuous)
could mark column :b
as explicitly continuous.
As a workaround for people in my situation, mapping(:a, :b => (x -> coalesce(x, NaN)))
works.
Missing values are now passed to Makie and do not signal that data is categorical anymore.
It would be really nice if AlgebraOfGraphics.jl ignored missing values better.
Current behavior is un-intuitive and likely not desired by anyone: It treats the entire column as categorical.
The desired behavior would be to drop missing pairs of
(x, y)
where eitherx
ory
aremissing
, similar to how AlgebraOfGraphics treatsNaN
.I'm happy to have a larger discussion about the semantics of
missing
for various edge cases that come up, but I deal with lots of missing data all the time and the current behavior makes it hard to usemissing
and iterate quickly to make plots.