Refactored rational-quadratic spline transforms to run faster

Hi,

I've noticed that the methods in rational_quadratic.py can be easily refactored to make them run ~25% faster.

The main change in unconstrained_rational_quadratic_spline is to avoid using masked select, which can be quite inefficient with dense masks, since it requires assembling all the "unmasked" elements into a new tensor. Instead, in order to do masked insert into a predefined zero tensor, it is generally cheaper to multiply the input tensor with a mask and add it to the target tensor, as I've done in this PR.

I've also made a couple of changes in rational_quadratic_spline about computing widths, heights and cumwidhts, cumheights tensors. The refactored implementation removes the redundancy of some of the operations in the original implementation.

The rational-quadratic spline flow as used in the NSF paper runs about 25% faster with these changes. I think some further improvements can be achieved if the searchsorted is replaced with torch.searchsorted when ran with the custom CUDA kernel as described in #19, but I haven't touched it since it would affect the other spline flows too.

I suppose the other spline flow methods can be refactored in a similar way. If you'd prefer I can make the necessary changes to them too in this PR.

Best, Vaidotas

bayesiains / nflows

Refactored rational-quadratic spline transforms to run faster #44