MakieOrg / AlgebraOfGraphics.jl

An algebraic spin on grammar-of-graphics data visualization in Julia. Powered by the Makie.jl plotting ecosystem.
https://aog.makie.org
MIT License
439 stars 45 forks source link

Ignore missing values #488

Closed pdeffebach closed 2 months ago

pdeffebach commented 9 months ago

It would be really nice if AlgebraOfGraphics.jl ignored missing values better.

Current behavior is un-intuitive and likely not desired by anyone: It treats the entire column as categorical.

The desired behavior would be to drop missing pairs of (x, y) where either x or y are missing, similar to how AlgebraOfGraphics treats NaN.

I'm happy to have a larger discussion about the semantics of missing for various edge cases that come up, but I deal with lots of missing data all the time and the current behavior makes it hard to use missing and iterate quickly to make plots.

jariji commented 8 months ago

In Julia missing doesn't generally get dropped implicitly because the presence of a missing can be important information.

julia> sum([1,2, missing])
missing

As with sum, I prefer plots to reflect the existence of missings in the data. Otherwise I am at risk of massively misinterpreting my data. Having a simple way of skipping missing values could be nice, but I would like it to be opt-in.

pdeffebach commented 8 months ago

@jariji

This isn't quite right. Even though, in Julia sum([1, 2, missing]) returns missing, existing libraries don't treat missing strictly. Both Plots.jl and Makie.jl omit missing values in arguments before plotting (similar to how GLM omits missing before running a regression).

Going further I conducted a survey of Julia, R, Python, and Octave (because I don't have a Matlab installation) to better understand how each library treats missing values. I created a dataset in Julia of ages and wages for individuals. The :wage variable is missing for 10% of the population. For each language I assessed (1) how missing is treated in mean(x) where x contains missing values, and more importantly (2) how missing values are handled during plotting.

As you can see, in all languages and plotting libraries, missing values are dropped before plotting. So if AlgebraOfGraphics.jl were to require separate handling of missing values, it would break with existing standards and expectations for plotting libraries.

Language | Plotting package | Missing treatment in mean | Missing treatment in plotting | Notes -- | -- | -- | -- | -- Julia | Plots.jl | Returns missing | Ignores missing |   Julia | Makie.jl | Returns missing | Ignores missing |   Julia | AlgebraOfGraphics.jl | Returns missing | Converts to categorical |   Julia | Makie.jl | Returns missing | Ignores missing |   R | ggplot | Returns missing | Ignores missing |   R | Base R | Returns missing | Ignores missing |   Python | Pandas | Ignores missing | Ignores missing | Using pd.NA Python | Numpy + Matplotlib | Returns missing | Ignores missing | Using nan Octave | Base | Returns missing | Ignores missing | Using NA

Note that this is true even in languages where missing values are propagated in mean(df.age), as Julia does. (The only framework which ignores missing values in mean is pandas).

Also note that in all example I show, I use the closest possible value to missing. In R I use NA, in Pandas I use pd.NA, and in Octave I use NA. The framework I test which does not include NA is Numpy, which only supports NaN.

Turning back to the earlier fact that missing is omitted in Makie.jl, my guess is that current behavior could probably be considered a bug, and is the result of an unnecessary <:Real dispatch somewhere in AlgrebraOfGraphics.jl. However I can't find it at the moment. @SimonDanisch, do you know where AlgebraOfGraphics.jl might be treating a Union{Missing, <:Real} vector as categorical?

Below are my implementations:

Julia data generation ```julia using CSV, DataFrames N = 1000 γ = .05 df = DataFrame(age = rand(20:65, N)) df.wage = map(df.age) do a w = a * γ + rand() * 10 rand() < .1 ? missing : w end CSV.write("data/wages.csv", df) ```
Julia plotting ```julia using GLMakie, Makie, AlgebraOfGraphics using CSV, DataFrames using Plots: Plots df = CSV.read("data/wages.csv", DataFrame) # Makie.jl (GLMakie.jl) ############################################## p = GLMakie.plot(df.age, df.wage) save("out/julia/makie.png", p) # Plots.jl ########################################################### p = Plots.scatter(df.age, df.wage) save("out/julia/plots.png", p) # AlgebraOfGraphics.jl ############################################### # This one is very messed up p = data(df) * mapping(:age, :wage) |> draw save("out/julia/algebraofgraphics.png", p) # Mean behavior ###################################################### m = mean(df.wage) # missing ```
R plotting ```r library(tidyverse) df = read_csv("data/wages.csv") # Base R ############################################################# png("out/R/baser.png") plot(df$age, df$wage) dev.off() # ggplot ############################################################# p = df |> ggplot(aes(x = age, y = wage)) + geom_point() ggsave("out/R/ggplot.png", p) # Mean behavior ###################################################### mean(df$wage) # NA ```
Octave plotting ```matlab df = csvread('data/wages.csv', "emptyvalue", NA) # Plotting ########################################################### p = scatter(df(:, 1), df(:, 2)) saveas(p, "out/octave/octave.png", "png") # Mean ############################################################### m = mean(df(:, 2)) # NA ```
Python plotting ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv("data/wages.csv") df = df.fillna(pd.NA) # Pandas graphing #################################################### ax = df.plot.scatter(x='age', y='wage') ax.figure.savefig('out/python/pandas.png') # Matplotlib graphing ################################################ age = np.array(df.loc[:, "age"]) wage = np.array(df.loc[:, "wage"]) p = plt.scatter(age, wage) p.figure.savefig('out/python/pyplot.png') # Pandas mean ######################################################## m = df.loc[:, "wage"].mean() # A real # Numpy mean ######################################################### x = np.array(df.loc[:, "wage"]) x.mean() # nan ```
jariji commented 8 months ago

ggplot2 does make an NA bar

> data.frame(a = c('one', 'two', 'two', NA)) %>% 
  ggplot(aes(a)) +
  geom_bar()

But I don't think prior art is the way to go here. A plot is a data summary, just like mean is, and there isn't really a prinicpled justification for removing unknown values for plotting specifically. The consistency argument favors retaining them --- the argument for dropping data is about pragmatics. I think we should find a simple way to satisfy users who want missing data hidden, but that shouldn't compromise the expectations of users who don't.

pdeffebach commented 8 months ago

That doesn't address my point. Your example is about how to handle missing when it is part of a categorical vector. I am talking about how to make a plot where I want a column to be treated as a continuous vector, but it contains missing.

aplavin commented 8 months ago

Matplotlib definitely doesn't just "ignore" nans, compare how the 2nd and 3rd lines here look in the plot:

import matplotlib.pyplot as plt
import numpy as np

plt.plot([1, 2, 3, 4, 5], [2, 1, 3, 5, 4])
plt.plot([1, 2, 4, 5], np.array([2, 1, 5, 4]) + 1)  # drop obs #3
plt.plot([1, 2, np.nan, 4, 5], np.array([2, 1, 3, 5, 4]) + 2)  # replace obs #3 with nan

Also, even if/when it does ignore nans, this behavior can be used as a consistency argument for handling NaN in Julia, not missing. And indeed, for numeric values, NaN in Julia is the most well-specified, widely propagated, type-stable, performant, interoperable object to put when the actual value is not available.

TBH, I'm not that worried about this specific case of a plotting library – just concerned about the general direction towards silently ignoring a part of data without user explicitly asking or knowing about that.

In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like

Julia: language where data handling libraries have safe defaults 🌈

pdeffebach commented 8 months ago

@aplavin I would still say "ignore" is the correct word. Whether there a gap in the line or it is filled in is not super material to me. The point is that the graph "works", the code runs, and the lines are still treated as continuous variables.

Please look at how AlgebraOfGraphics.jl handles this scenario. The current behavior is very clearly a bug and could not reasonably be the intention of anyone making a plot.

In plots, the best default would likely be to show missing values separately whenever possible. Eg, as a separate bin in the histogram, as a small inset in a scatterplot, etc. This would be a nice selling point of Julia and its plotting ecosystem, like

This solution would make my life not better than current behavior, as I would have to turn off these extras any time I wanted to make a plot.

I will work on a PR to change current behavior to match other plotting libraries.

aplavin commented 8 months ago

The point is that the graph "works", the code runs

That attitude easily leads to silent correctness issues. Agree with @jariji, developing convenient but explicit and opt-in ways to handle missings is a much better solution. Luckily, lots of Julia functions return nothing to indicate "no value", which is not susceptible to such issues. Still, would be nice to have an alternative (such as missing) that always propagates whenever possible – but still is never silently ignored.

pdeffebach commented 8 months ago

I want to re-emphasize that I explored the behavior of R, Python, Octave and other Julia libraries and all of them seamlessly create graphs with missing values. AlgebraOfGraphics.jl is the odd one out.

@SimonDanisch hopefully you can help with a PR when I open one.

@aplavin @jariji I will not be continuing this conversation, as it is not useful to go back and forth.

laikq commented 8 months ago

Hey, I comment because I need a simple way to tell AlgebraOfGraphics that my data is indeed continuous and not categorical, just because it contains missing values. I agree that silently ignoring missing value could lead to misleading graphics, but I cannot imagine a case where treating Union{Missing, Float64} as categorical is actually what someone needs.

Maybe introducing something like nonnumerical would help? mapping(:a, :b => continuous) could mark column :b as explicitly continuous.

As a workaround for people in my situation, mapping(:a, :b => (x -> coalesce(x, NaN))) works.

jkrumbiegel commented 2 months ago

Missing values are now passed to Makie and do not signal that data is categorical anymore.