Use torch.searchsorted instead of our ad-hoc implementation

jmarshrossney commented 4 years ago

Hi!

I compared the searchsorted function implemented here, that does torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1, with the implementation in C++ here -- https://github.com/aliutkus/torchsearchsorted -- and it appears to be a lot slower on CPU at least.

I modified the benchmark.py in torchsearchsorted and just copy-pasted the function from nflows for comparison. The output was (all on CPU)

Benchmark searchsorted:
- a [5000 x 16]
- v [5000 x 1]
- reporting fastest time of 10 runs
- each run executes searchsorted 100 times

Numpy:  0.9516626670001642
torchsearchsorted:  0.009861100999842165
nflows:     50.19729063499926

i.e. sorting 5000 inputs into 5000 individual sets of 16 bins.

Am I missing something here? If not, it looks like the spline flows could be sped up quite a bit by using torchsearchsorted or something similar?

Cheers.

arturbekasov commented 4 years ago

Hi Joe,

Indeed, our searchsorted implementation is slower than the CUDA implementation that you reference. We've done similar comparisons ourselves at some point. We've decided against using the CUDA implementation for a few reasons:

Custom CUDA kernels can be a headache to use. They need to be compiled at runtime, which means the machine you're running on must have the right compiler and development headers.
Given that both tensorflow and numpy have it, we were quite confident that searchsorted would be merged into PyTorch eventually. And indeed it has since been merged: see the related issue and docs.
You're right in that the difference in performance can be drastic when running searchsorted in isolation. However, bucketization is one of the many operations performed when running a spline flow. As a result, in end-to-end benchmarks we've observed ~30% speed-up when using the custom CUDA kernel. A noticeable improvement, but not an orders-of-magnitude one.

Hope this makes sense. In terms of next steps, now that searchsorted is in PyTorch, a 30% speed-up is well worth replacing our ad-hoc implementation with torch.searchsorted. My only concern would be that we'd have to depend on a very recent version of PyTorch. In fact, the latest stable one (1.6). I don't know how big of a deal this would be.

Thanks,

Artur

arturbekasov commented 4 years ago

For future reference: #9 has been merged which is also using a feature that is only available in PyTorch 1.6 (non-persistent buffers).

bayesiains / nflows

Use torch.searchsorted instead of our ad-hoc implementation #19