Scatter plot with density cross-diagonal artifact

DHI / modelskill

Compare results from MIKE and other simulations with measurements

https://dhi.github.io/modelskill

MIT License

32 stars 8 forks source link

Scatter plot with density cross-diagonal artifact #278

Closed ecomodeller closed 10 months ago

ecomodeller commented 10 months ago

There seems to be an issue with the 2d density plot used in the scatter plot

from modelskill.plotting import scatter
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
X = np.random.multivariate_normal([0, 0], [[1, 0.98], [0.98, 1]], 20000)
x = X[:, 0]
y = X[:, 1]
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
scatter(x,y,ax=ax[0])
scatter(x, y, show_density=False, show_hist=True, show_points=False, bins=100, ax=ax[1]);

These bands seems like an artifact.

ecomodeller commented 10 months ago

Same artifact with the plotly backend

ecomodeller commented 10 months ago

And if your variables should be negatively correlated (hopefully not the case for most models😉)

jsmariegaard commented 10 months ago

@daniel-caichac-DHI ? Can you verify this?

daniel-caichac-DHI commented 10 months ago

Yes I have seen this artifact, it depends on the number of bins the data is clustered for the density plot. The smaller the bin, the smaller the artifact. So if you define 1000 bins in your example, you should then not see it. It is the trade off for binning the data (2d histogram) for quick plotting. We could, alternatively, do some KDE estimation of the density of the data and use that for the color scale, but when you have 1e6 points or more, it can take an eternity.

jsmariegaard commented 10 months ago

Could we use overlapping bins to overcome this artifact (cheap alternative to rolling bins)?

daniel-caichac-DHI commented 10 months ago

Ok I just had some time to look at this, I replicated @ecomodeller code, but I think the solution is far more simple.

Topfigures: 100 bins, both points and histogram. Bottomfigures: Default (20 bins), both points and histogram.

The solution I see it as simple as as increased the extremely low default which is now , bins=20, to something like bins=100 or bins=200.

The scatter plot now follows the histogram. If by default we have bins=20 (as of now), we are clustering water level data in chunks of ~0.5m by ~0.5m, so of course it will look horrid.

The comparison that JAN did before was comparing a histogram of 100 bins vs a scatter plot with points whose colorscale comes from a histogram of just 20 bins, so it is not a fair comparison.

daniel-caichac-DHI commented 10 months ago

Sent this PR

https://github.com/DHI/modelskill/pull/282

jsmariegaard commented 10 months ago

Closed by #282