astrofrog / fast-histogram

:zap: Fast 1D and 2D histogram functions in Python :zap:
BSD 2-Clause "Simplified" License
267 stars 28 forks source link

histogram result has strange spikes as compared to numpy histogram #27

Closed sahaskn closed 6 years ago

sahaskn commented 6 years ago

Data file : https://ufile.io/cnj9l

Python Code:

import numpy as np import matplotlib.pyplot as plt from fast_histogram import histogram1d

data = np.load('x.npz') hnp, = np.histogram(data['x'], bins=1100, range=[0, 1100] ) h_fast = histogram1d(data['x'], bins=1100, range=[0, 1100] )

plt.plot(h_np[:-1], 'r--', label='numpy') plt.plot(h_fast[:-1], label='fast') plt.legend()

The result is as : image

Also in histogram2d, the spikes are there.

astrofrog commented 6 years ago

@sahaskn - the issue is that the values you are reading in are integer values (well, they are float32, but they are round values such as e.g. 1.0). I think this is causing some deterministic behavior in cases where the bin edges line up exactly with the values. For instance, if you had values of 0.0, 1.0, and 2.0, and the histogram went from 0 to 2 with two bins, it's not clear which bin the value 1.0 should fall in. Could you try changing the number of bins to see if it's just an issue when bins is 1100? I can investigate how Numpy treat these 'edge-cases' to see if we can be more consistent with them.

astrofrog commented 6 years ago

Just to add to my comment above, given the values you have, you can avoid non-deterministic effects by choosing range=[-0.5, 1100.5] and bins=1101 so that the values fall at the center of the bins.

sahaskn commented 6 years ago

@astrofrog. Thanks for the reply. Numpy documents shows that in histogramming, lower edge is included and upper edge is excluded for all bins except the last bin. And I assumed the way fast_histogram also works except for last bin where the upper edge is not included.
Thanks for suggesting range=[-0.5, 1100.5]] with bins=1101 which worked.

Histogram2d is really fast!!!