Collaboration - Githubissues

ml31415 commented 6 years ago

Hi guys! Friendly greetings from over there at numpy-groupies. Looks like we're doing quite similar stuff there. What do you guys think of joining forces and making one package out of it, instead of two? numpy-groupies already had such a join a while ago, and collaboration works quite well so far. You did a good job in documenting the package and also have a bunch of features we didn't implement yet. At the same time, we'd have to offer drastic speedups for many functions. np.add.at and other ufuncs are sadly slow, and also pre-sorting arrays is a tedious task, that can be omitted in many cases. Pls check out our project, and let me know what you think!

https://github.com/ml31415/numpy-groupies

Korijn commented 6 years ago

Love the suggestion!

Speaking for myself though, I don't have time to meaningfully contribute to this project.

EelcoHoogendoorn commented 6 years ago

I like the spirit of it as well, but I started this project with some design goals in mind that I think are at odds with it.

I started this as an extension of the numpy implementation of np.unique; with the intention of getting such changes merged into numpy eventually. I did not want to be tied to the very conservative release schedule and backwards compatibility requirements of numpy though, but keep some room for experimental features, so I made this separate package. Some of the functionality implemented here has since made it into numpy; and most of the rest of it ought to make it there in due time, in my estimation. Thats still the long term goal.

That is the reason I am also very conservative with external dependencies in this project. Yes, some things could be faster with numba or somesuch; but I want to keep it to pure numpy here, and focus on generality and clean interfaces, not on getting the last bit of speed out. I think its a fair rule of thumb that if it has optimal complexity and gets within a factor 4 of a handcrafted C implementation, its good enough for a numpy implementation; and thats the philosophy I am sticking to here.

Taking these different possible goals / emphasis in mind, I think it makes sense to have separate packages.

ml31415 commented 6 years ago

Our initial grief were the slow and incomplete ufuncs and we'd love to see a faster version integrated into numpy as well. I agree, our main focus was getting a reasonably fast version right away, only aiming for a numpy integration in the very long run. We had started an issue on the numpy bugtracker once discussing the slow ufuncs, though it didn't look like there would be an easy fix for it.

In terms of speed, what we see by benchmarking our numba and weave code, it runs about 30x faster than numpy ufuncs. Using np.bincount or plain indexing, things get better for numpy, and as you said, the hand crafted code is only 2-4x faster for some functions. Still, looking at things like grouped cumsum, required to do some sorting with plain numpy, you see a 100x speedup.

About external dependencies, I agree, the less the better. I don't want to force anyone to use numba, just for doing some indexed operations. At the same time there are applications (having such one on my own), where you just want to run things as fast as possible and don't mind installing another dependency, or you use those dependencies anyways already. Our approach to that was having different implementations for different tastes / requirements, sharing the same interface. I guess that might also be a reasonable approach for a possible combined project.

So, I agree, goals and emphasis are a bit different, but imho still not that far off, that a combined package would seem unrealistic. In terms of generality and clean interface, full acc. It also wouldn't need to happen right away. I guess adding a wrapper around numpy-groupies, providing an numpy-indexed-like interface might make sense, and mb also the other way round. Mb providing a common interface at some point and agreeing on how an interface for implementations with different dependencies might look like. Anyways, up to you. Mb you want to give it another thought!

Kind regards,

Michael

EelcoHoogendoorn commented 6 years ago

You are correct; in some cases the numpy solution leaves a lot of potential gains on the table; however, for my personal use cases it has always been 'good enough'. btw; I see your numpy based groupby-sum is based on numpy.add.at; but i found that for a typical use case, explicitly sorting and using numpy.add.reduceat is a lot faster, without losing any generality.

I suppose we could hide different 'backends' behind a single interface, and then people could opt-in to a faster version by setting an alternative backend, if they have the required optional dependencies installed?

I dont have a clear intuition if that would fit well together, or if the different backends would impose subtly different interface requirements... for instance, the indirect sort is something that is often reusable; if you want to compute multiple independent reductions or operations on some indexed set; that is the main motivation for numpy-indexed OOP design. But would such basic considerations even hold up if you were not using indirect sorting?

ml31415 commented 6 years ago

Yes, I agree. Which backend (we referred to them as 'implementation' in npg, anyways) is used should be transparent for the user, except explicitly asked for one. We had implemented that by a bunch of simple import attempts for selecting the fastest possible solution automatically. There might be something more clever, but no one complained so far.

Groupby-sum? We had used np.bincount with weights for that, which actually gives rather competitive speeds, about 15x faster than the add.at solution. The numpy-ufunc backend is only there for reference. We included it, so that we can benchmark against it and pinpoint the amount of possible improvements there.

About subtle differences, yes, that's surely an issue, but something that can be handled by rigorous parametrized testing, running a thorough test suite against any implemented backend and comparing the results against each other. The major difference for us so far is the feature set differing from backend to backend. E.g. I didn't implement any sorting functions in numba or with weave. Also, as I consider the weave part as more or less deprecated, I didn't bother implementing new features for that one anymore. A solution would be, to always fall back to the numpy only version.

About the subtly different interface requirements, I'd suggest to provide optional keyword arguments, which were simply ignored by other backends. About reusing the argsort index for different operations: indeed, there is no big need to reuse anything, if you just do simple 1-pass operations. We had a similar discussion about doing smth like std and var in one go. Though as discussed there, there is probably not too much to gain from that.

EelcoHoogendoorn / Numpy_arraysetops_EP

Collaboration #7