arviz-devs / arviz-plots

ArviZ modular plotting
https://arviz-plots.readthedocs.io
Apache License 2.0
2 stars 1 forks source link

[WIP] Histogram support addition to distplot.py #47

Closed imperorrp closed 2 weeks ago

imperorrp commented 4 months ago

This PR aims to add support for histograms in distplot.py (Issue #24 ), allowing marginal densities to be visualized in this new form, apart from preexisting KDE and ECDF forms


📚 Documentation preview 📚: https://arviz-plots--47.org.readthedocs.build/en/47/

imperorrp commented 4 months ago

@OriolAbril I had to dig a bit deeper into understanding how the plotting actually works as coded, but I think I've managed to get step 1 (computing the histogram) roughly done- I'll probably have to combine these DataArrays into a Dataset and then create the visual element for mapping. Also, I'll remove the print statements I've added in later commits- just added them to help with seeing the outputs. Is something like this alright?- Looping through the data variables in distribution to get DataArrays and then passing that plus the dimensions to be reduced to the xarray_einstats.histogram() function.

Also, I was going through the flow of how density was computed for KDE as an example (in Arviz-Stats). I saw bandwidth computing was done for KDE- but that won't be necessary for plotting histograms, right?

imperorrp commented 4 months ago

Combined the xarray_einstats.numba.histogram produced DataArrays into a Dataset:

Dataset-image

(This is with the Schools dataset)

It's in the style of the KDE dataset returned by distribution.azstats.kde to density

imperorrp commented 4 months ago

Changed the plot_axis coords to 'x' and 'y' in place of 'bin_midpoint' and 'bin_height':

image

Also added a visual element function, backend interface and matplotlib implementation with matplotlib.pyplot.bar:

image

The bar widths are different due to the different scales of the x-axis. I'm not sure how this can be fixed yet if keeping the width a consistent value is desired though. Also, the number of bins is set to 20 by default. Some bins have too few values to be visible as their respective bars are too low in height compared to the taller ones.

imperorrp commented 4 months ago

It is looking very good already. The computation of the histogram is a bit trickier and would not be part of the gsoc project, but I think it will be helpful to get you more familiar with xarray.

I have tried to focus on the plotting and contributing workflow with the comments. Let me know if you have any doubts related to testing.

Thank you, I will take a look at your reviews and advice!

imperorrp commented 3 months ago

@OriolAbril Latest commit has changes incorporating advice from your comments. Modified data restructure utility function to keep left and right edges info now until the matplotlib backend calculates midpoints from these at the end to create the bars, and I've updated the docstring and color/facecolor/edgecolor interface

OriolAbril commented 2 months ago

By the way, I commented on the ess and mcse issues but forgot to comment here. There is a histogram function available in arviz-stats now. I think the output will have a similar (if not the same) shape to what you are using

imperorrp commented 2 months ago

By the way, I commented on the ess and mcse issues but forgot to comment here. There is a histogram function available in arviz-stats now. I think the output will have a similar (if not the same) shape to what you are using

Altered the code to make use of this function. Also, it seems like point estimate and credible intervals are being plotted so close to the x axis for histogram plots because the default 'y' aesthetic value/s returned by plot collection are pretty low in relation. The maximum y-axis values for the KDE/ECDF plots are way lower so it's not a problem for them. I initially thought there was some issue with this aesthetic mapping when type hist is picked, but it works fine- it's just really small values compared to the histogram heights. Should we mention this in the documentation, asking users to define custom 'y' aesthetic values (in pc_kwargs)when using kind=histogram for better visibility?

For the plot_dist example where one subplot is created for each variable:

azp.plot_dist(
        data,
        kind="kde",
        pc_kwargs={
            "cols": ["__variable__"],
            "aes": {"color": ["school"], "y": ["school"]},
            "y": np.linspace(0, 0.06, 9),
        },
        aes_map={
            "kde": ["color"],
            "point_estimate": ["color", "y"],
            "credible_interval": ["y"],
        },
    )
    plt.savefig("distplot-kde-cols-matplotlib.png")

image

Point estimates when switching from kind="kde" to kind="hist":

azp.plot_dist(
        data,
        kind="hist",
        pc_kwargs={
            "cols": ["__variable__"],
            "aes": {"color": ["school"], "y": ["school"]},
            "y": np.linspace(0, 0.06, 9),
        },
        aes_map={
            "kde": ["color"],
            "point_estimate": ["color", "y"],
            "credible_interval": ["y"],
        },
    )
    plt.savefig("distplot-hist-cols-matplotlib.png")

image

This is easily fixable with editing the np.linspace line though-

"y": np.linspace(0, 600, 9),

image

We could maybe add this in the documentation so users know to keep this in mind.

imperorrp commented 1 month ago

Just pushed changes per last review

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 57.81250% with 27 lines in your changes missing coverage. Please review.

Project coverage is 84.84%. Comparing base (1298e42) to head (2304364). Report is 1 commits behind head on main.

Files Patch % Lines
src/arviz_plots/plots/utils.py 13.33% 13 Missing :warning:
src/arviz_plots/backend/none/__init__.py 58.33% 5 Missing :warning:
src/arviz_plots/backend/bokeh/__init__.py 55.55% 4 Missing :warning:
src/arviz_plots/backend/matplotlib/__init__.py 63.63% 4 Missing :warning:
src/arviz_plots/backend/__init__.py 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #47 +/- ## ========================================== - Coverage 85.85% 84.84% -1.02% ========================================== Files 17 17 Lines 1951 2012 +61 ========================================== + Hits 1675 1707 +32 - Misses 276 305 +29 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

OriolAbril commented 1 month ago

This will need to be rebased and double checked to make it compatible with plotly. Take care of these last two comments left and after that I'll rebase and add more commits to the PR so it works with plotly