ResidentMario / missingno

Missing data visualization module for Python.
MIT License
3.97k stars 518 forks source link

Add colour options for removal recommendation/causes #131

Closed JoshuaC3 closed 3 years ago

JoshuaC3 commented 3 years ago

Let's say we have a dataset with an array of X and y variables, all of which contain some level nullity. How can one go about deciding which columns and rows to drop?

I am putting forwards a simple yet effective addition to the missingno package where we not only highlight missing values, but also highlight the rows and columns that should be imputed or dropped, based on some simple logic and thresholds. I have called the extension fillingno (happy to change this) and have added it to a fork in missingno on my github.

How does it work?

import numpy as np
import pandas as pd
import missingno as msno

X = pd.DataFrame(
    {
        'ones': np.ones(50),
        'rand': np.random.normal(size=50),
        'linear': np.linspace(1, 50),
    }
)

y = 3 * np.sin(X.linear) + X.ones + X.rand
y = y.rename('sin_target')

X.loc[4:10, 'ones'] = np.nan
X.loc[4:20, 'rand'] = np.nan
X.loc[40:45, ['linear', 'ones']] = np.nan
y.loc[23:25] = np.nan

Xy = pd.concat([X, y], axis=1)

msno.matrix(Xy)

image

In this small dataset it is easy to see what is going on, and how to treat your data, but in a larger dataset it is likely to be much more difficult to distinguish between which rows overlap and how to treat them. This is where fillingno would be useful:

from missingno import fillingno as flno

flno.matrix(Xy, key_cols=['sin_target'], col_thresh=0.7, row_thresh=0.4)

image

What's going on here?

Firstly, here is the colour coding: Color: Green - Data: Good.
Color: Blue - Data: Missing-Interpolated.
Color: Amber - Data: Removed-non-Causal. Note: not checked if missing.
Color: Red - Data: Removed-Causal.

That is to say, we recommend, with the given thresholds and key_columns, to remove all those in amber and red. The amber indicates clean values which are to be removed as a result of those in red. Red therefore are the missing values that reach the criteria to be removed. We then recommend interpolation or filling of those in blue (as they fail to meet the drop/removal criteria). Finally, the green is clean data.

Here is a more of a breakdown:

Step 1: Drop sin_target NaNs.

Xy_t = Xy.drop(Xy.index[Xy.sin_target.isna()])
msno.matrix(Xy_t)
# or...
flno.matrix(Xy, key_cols=['sin_target'])

image image

Step 2: Drop columns with threshold.

Xy_t = Xy.dropna(thresh=0.7 * len(Xy), axis=1)
msno.matrix(Xy_t)
# or ...
flno.matrix(Xy_t, col_thresh=0.7)

image image

Step 3: Drop rows with threshold.

Xy_t = Xy.dropna(thresh=0.4 * Xy.shape[1], axis=0)
msno.matrix(Xy_t)
# or ...
flno.matrix(Xy_t, row_thresh=0.4)

image image

Step 4: Colour other missing values in blue to indicate interpolation.

Xy_t = Xy.fillna(0)
msno.matrix(Xy_t)
# or ...
flno.matrix(Xy)

image image

Hopefully it is clear how this relatively simple extension can make your missing data much easier to understand and process.

The code structure might not be perfect, but it works well enough and I believe it is very readable. It also is written so that the functions should be easily used by other plotting packages other then matplotlib (e.g. Seaborn, Plotly-Dash, etc). If you have an suggestions on how to make it better, please say.

I have started adding tests and documentation for these functions.

I intend to extend this functionality to the other plotting types such as bar.

JoshuaC3 commented 3 years ago

https://github.com/JoshuaC3/missingno

ResidentMario commented 3 years ago

To be honest, I don't think most practical usages of missingno benefit from this level of interpretive complexity. Most of the time people want a yes/no to drive further exploration. The imputation then becomes user logic. Looks cool, though!