astrofrog / fast-histogram

:zap: Fast 1D and 2D histogram functions in Python :zap:
BSD 2-Clause "Simplified" License
266 stars 28 forks source link

Add N-dimensional histogramming #54

Closed atrettin closed 3 years ago

atrettin commented 3 years ago

Hi all,

I happen to have a (scientific) application where I need fast histograms, and I found this package very promising to be exactly what I need! If it would only support histograms in arbitrary dimensions, it would be perfect... and I think that I did in fact manage to pull it off. :)

This PR adds the function histogramdd to the package.

It is a straight generalization of the 1D and 2D code, only that the input sample is given as a tuple. The code discovers automatically how many arrays there are in the tuple and generates the output dimensions accordingly. If the dimension is D and the number of samples N, then the tuple must contain D arrays with N elements. Bins and ranges are passed as arrays as well with shapes (D,) and (D, 2), respectively to pass the number of bins as well as the ranges for each dimension.

The output of histogram2d(x, y, range, bins) should be the same as histogramdd((x, y), range, bins). The speed of the generalized function is a bit slower, so it is still beneficial to use the dedicated 1D and 2D functions where appropriate.

I'm also working on the weighted version but I thought I'd make a PR to get some feedback. This is still missing tests and documentation, and maybe someone with more C and numpy API experience could have a look if there is any possibility to make things a little bit faster still to match the performance of the 1D and 2D code.

astrofrog commented 3 years ago

Thank you for the PR!! I will review it shortly.

atrettin commented 3 years ago

Alright, this is looking much better now! I've added the weighted version, added more input checks and automatic conversions and also added histogramdd to the unit tests. It gives verifiably the same results as histogram1d and histogram2d.

atrettin commented 3 years ago

Alright, now we also have tests for up to 10D against numpy.histogramdd. I think that should be all we need. :)

atrettin commented 3 years ago

Hey @astrofrog, it would be great if you could have a look at this, I hope that everything makes sense and I'd be happy to implement changes if need be.

astrofrog commented 3 years ago

Sorry for the delay in reviewing! I am only working part time at the moment so I am a bit slower than usual but I should get to it this week.

atrettin commented 3 years ago

Done! :)

astrofrog commented 3 years ago

Thanks! I'll merge this and will try and do a release soon.

atrettin commented 3 years ago

Thank you for merging! Just now looking at this I found an inconsistency between this package and numpy: When a data point is exactly at the upper bound of the largest bin, numpy includes it into the largest bin count, while we don't do that here. Furthermore, I found that in my own code this causes it to write one index farther into the count array, potentially risking a segfault... 😬 Sorry I just literally now realize that. I'll write a unit test demonstrating it and bring a fix, please hold on with the release until then!