Open jmarshrossney opened 4 years ago
Hi Joe,
Indeed, our searchsorted
implementation is slower than the CUDA implementation that you reference. We've done similar comparisons ourselves at some point. We've decided against using the CUDA implementation for a few reasons:
searchsorted
would be merged into PyTorch eventually. And indeed it has since been merged: see the related issue and docs.searchsorted
in isolation. However, bucketization is one of the many operations performed when running a spline flow. As a result, in end-to-end benchmarks we've observed ~30% speed-up when using the custom CUDA kernel. A noticeable improvement, but not an orders-of-magnitude one.Hope this makes sense. In terms of next steps, now that searchsorted
is in PyTorch, a 30% speed-up is well worth replacing our ad-hoc implementation with torch.searchsorted
. My only concern would be that we'd have to depend on a very recent version of PyTorch. In fact, the latest stable one (1.6). I don't know how big of a deal this would be.
Thanks,
Artur
For future reference: #9 has been merged which is also using a feature that is only available in PyTorch 1.6 (non-persistent buffers).
Hi!
I compared the
searchsorted
function implemented here, that doestorch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1
, with the implementation in C++ here -- https://github.com/aliutkus/torchsearchsorted -- and it appears to be a lot slower on CPU at least.I modified the
benchmark.py
in torchsearchsorted and just copy-pasted the function from nflows for comparison. The output was (all on CPU)i.e. sorting 5000 inputs into 5000 individual sets of 16 bins.
Am I missing something here? If not, it looks like the spline flows could be sped up quite a bit by using torchsearchsorted or something similar?
Cheers.