I've noticed that the methods in rational_quadratic.py can be easily refactored to make them run ~25% faster.
The main change in unconstrained_rational_quadratic_spline is to avoid using masked select, which can be quite inefficient with dense masks, since it requires assembling all the "unmasked" elements into a new tensor. Instead, in order to do masked insert into a predefined zero tensor, it is generally cheaper to multiply the input tensor with a mask and add it to the target tensor, as I've done in this PR.
I've also made a couple of changes in rational_quadratic_spline about computing widths, heights and cumwidhts, cumheights tensors. The refactored implementation removes the redundancy of some of the operations in the original implementation.
The rational-quadratic spline flow as used in the NSF paper runs about 25% faster with these changes. I think some further improvements can be achieved if the searchsorted is replaced with torch.searchsorted when ran with the custom CUDA kernel as described in #19, but I haven't touched it since it would affect the other spline flows too.
I suppose the other spline flow methods can be refactored in a similar way. If you'd prefer I can make the necessary changes to them too in this PR.
Hi,
I've noticed that the methods in
rational_quadratic.py
can be easily refactored to make them run ~25% faster.The main change in
unconstrained_rational_quadratic_spline
is to avoid using masked select, which can be quite inefficient with dense masks, since it requires assembling all the "unmasked" elements into a new tensor. Instead, in order to do masked insert into a predefined zero tensor, it is generally cheaper to multiply the input tensor with a mask and add it to the target tensor, as I've done in this PR.I've also made a couple of changes in
rational_quadratic_spline
about computingwidths
,heights
andcumwidhts
,cumheights
tensors. The refactored implementation removes the redundancy of some of the operations in the original implementation.The rational-quadratic spline flow as used in the NSF paper runs about 25% faster with these changes. I think some further improvements can be achieved if the
searchsorted
is replaced withtorch.searchsorted
when ran with the custom CUDA kernel as described in #19, but I haven't touched it since it would affect the other spline flows too.I suppose the other spline flow methods can be refactored in a similar way. If you'd prefer I can make the necessary changes to them too in this PR.
Best, Vaidotas