[performance] slogdet is slow on GPU

Nintorac commented 3 years ago

Hey, great codebase, thank you!

I was looking into performance bottlenecks and I found the following which gave me almost a 2x (1.76 it/s -> 2.96 it/s) speedup for the cifar10 example

The issue is in the Conv1x1 module. the calculation of torch.slogdet is much slower on GPU than CPU

https://github.com/didriknielsen/survae_flows/blob/master/survae/transforms/bijections/conv1x1.py#L40

This is the modified fast _logdet

    def _logdet(self, x_shape):
        b, c, h, w = x_shape
        _, ldj_per_pixel = torch.slogdet(self.weight.to('cpu'))
        ldj = ldj_per_pixel * h * w
        return ldj.expand([b]).to(self.weight.device)

hmdolatabadi commented 3 years ago

Hi,

Related to this issue, if you try a large network (e.g. the Glow architecture for CIFAR-10), then you may encounter an error in the middle of training which says:

File "./examples/cifar10_aug_flow.py", line 102, in <module>
    loss.backward()
  File "/home/user/.conda/envs/idf/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/.conda/envs/idf/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 23)

After looking it up on Google, it seems to me that the SVD operation of _slogdet may be responsible for this. On a note in PyTorch official documentation they say:

Backward through slogdet() internally uses SVD results when input is not invertible. In this case, double backward through slogdet() will be unstable in when input doesn’t have distinct singular values. See svd() for details.

I haven't tested the above solution to see whether it has an effect or not.

UPDATE: After trying the above solution, the same problem happened to me on epoch 40 when I was training a model:

File "./examples/cifar10_aug_flow.py", line 102, in <module>
    loss.backward()
  File "/home/user/.conda/envs/idf/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/.conda/envs/idf/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: svd_cpu: the updating process of SBDSDC did not converge (error: 23)

didriknielsen commented 3 years ago

The issue is in the Conv1x1 module. the calculation of torch.slogdet is much slower on GPU than CPU

Hi,

Thanks! This is gold. I tried on my computer I also found a ~20% speedup by running torch.slogdet on CPU. I've added an argument slogdet_cpu in Conv1x1 with default=True.

didriknielsen commented 3 years ago

Related to this issue, if you try a large network (e.g. the Glow architecture for CIFAR-10), then you may encounter an error in the middle of training which says: [...]

The CIFAR-10 example uses the default scale_fn=lambda s: torch.exp(s) in the AffineCouplingBijection. This choice can lead to instability during longer training since the scales output by the coupling networks can become very large.

I would suggest using something like

scale_fn=lambda s: torch.exp(2. * torch.tanh(s / 2.)) or
scale_fn=lambda s: torch.sigmoid(s+2.)+1e-3

instead, which keep the scales bounded.

The first choice is what we used in our image experiments, the second is what was used in the Glow code.

didriknielsen / survae_flows

[performance] slogdet is slow on GPU #5