astrofrog / fast-histogram

:zap: Fast 1D and 2D histogram functions in Python :zap:
BSD 2-Clause "Simplified" License
267 stars 28 forks source link

different result compared to numpy #61

Open d5423197 opened 1 year ago

d5423197 commented 1 year ago

Hello there,

I am trying to use this repo to replace numpy but get different result.

I put range as the minimum of the input and the maximum of the input. But I found out that the result is missing some maximum value.

For example,

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10]) freq, bins = np.histogram(test_case, range(np.min(test_case), np.max(test_case + 1))) result = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case)))

d5423197 commented 1 year ago

Is this repo still maintained?

d5423197 commented 1 year ago

For numpy 1d histogram function, if you set bins as 10, the returned hist would be length of 9. But for fast histogram 1d function, if you set bins as 10, the returned hist would be length of 10 which is inconsistent.

test_case = np.array([1, 1, 2, 2, 3, 3, 10, 10])
freq, bins = np.histogram(test_case, bins=range(np.min(test_case), np.max(test_case + 1)))
test = np.bincount(test_case, minlength=9)
result = histogram1d(test_case, bins=10, range=(np.min(test_case), np.max(test_case)))
result_1 = histogram1d(test_case, bins=9, range=(np.min(test_case), np.max(test_case) + 1))
result_2 = histogram1d(test_case, bins=10, range=(np.min(test_case), np.max(test_case) + 1))

I realized that fast histogram set the upper range as excluded which is inconsistent with numpy. Correct me if I am wrong.

I have tried many ways. The result_2 is the closest one but with a length of 10.

I really want to replace numpy histogram with a fast histogram. But I need the same result.

astrofrog commented 1 year ago

Yes this is still maintained - will respond soon!

astrofrog commented 1 year ago

@d5423197 if you are trying to bin integers, I highly recommend using np.bincount - what you are seeing here is a subtle difference between Numpy and fast-histogram which is that indeed if a value is exactly the same as the upper bound of the range then it will not be included in fast-histogram (this is for performance). If you prefer not to use np.bincount (which should be the fastest if you really are trying to bin integers) then another option is to add a tiny value to the upper end of the range when calling fast-histogram, e.g, instead of binning from 0 to 10 you would bin from 0 to 10 + 1e-30 or similar. Does this make sense?