ResidentMario / missingno

Missing data visualization module for Python.
MIT License
3.94k stars 518 forks source link

Displaying data labels in Y axis on the left (instead of 1 and number of rows) #36

Open gurol opened 7 years ago

gurol commented 7 years ago

Could we write the labels of data in Y axis just like time-series data? (like in given example: msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ') but for text)

DataLabels DS2 DS0 DS1 DS3 DS5
LABEL_1 0.001132 NaN 0.011811 0.002 0.000712
LABEL_2 0.013395 0.012160 0.007874 0.007 0.005013
ResidentMario commented 7 years ago

Can you describe this a bit more? I'm not sure what you mean.

gurol commented 7 years ago

missingnoexample The image above is an example showing all the nos (data labels).

And even the best: there could be the following options to display

where the labels are defined in the index of the source data frame. Parametric text colors can be used to distinguish missing and existing labels. It is not necassary to position the labels at the exact point in the graph. Just dump the labels on the left side from the top by adjusting the font size. Thank you for your interest.

ResidentMario commented 7 years ago

Looking at this again, I don't think this is possible. The problem is that to get a useful sample of your dataset you need to include at least 100 or so records, which would mean 100 or so labels, which would be tiny. You wouldn't be able to read them at all!

It should be possible (with text label collision detection, which IIRC exists somewhere) to label some subset of the offending data in the display. However, that requires user input explaining, somehow, what the "anomaly threshold" is. And that starts too look to complicated to me for such a basic chart!

I'll nevertheless leave this feature request open, for now.

jason-r-becker commented 6 years ago

I think this could still be a useful feature for time-series data. Functioning similar to autofmt_xdate() from matplotlib https://matplotlib.org/_modules/matplotlib/figure.html#Figure.autofmt_xdate. Having a few dates would be useful to visualize the times associated with missing data.

remisphere commented 5 years ago

This would also be useful when working with Pandas' MultiIndex, with the option to choose a particular level that encompasses several samples.

For example, I am working with a dataset consisting of stereo video frames recorded from a car, and have sorted them in a dataframe with the following row multi-index: environement / recording session / stereo side / timestamp While displaying every timestamp would be as you said impossible (additionally because timestamps are not related from one recording session to another), printing only the much sparser environment or recording labels would allow to better localise where data is missing (provided that the dataframe is sorted).

From what I have seen when trying using a multi-index on the column axis, Missingno just reads it as a tuple.

arturomoncadatorres commented 4 years ago

@gurol I solved this with a few extra lines of code after calling msno.matrix. In my df, I had a column called year and I wanted to see if there were some years that had missing values. Therefore, my code looked like this:

df = df.sort_values(by=['year'])

fontsize = 20

fig, ax = plt.subplots(1, 1, figsize=[20, 14])
msno.matrix(df=df, ax=ax, color=(0.2, 0.2, 0.2), sparkline=False, fontsize=fontsize)

years = list(df['year'].unique())
ylim_start, ylim_end = ax.get_ylim()
step_size = df.shape[0] / len(years)
_ = ax.yaxis.set_ticks(np.arange(ylim_end, ylim_start, step_size))
_ = ax.yaxis.set_ticklabels(years, fontsize=fontsize)

@ResidentMario would this be a feature that you would be interested in adding to missingno? If so, we could further discuss the implementation and I could take the lead in making a PR