meet svd_cup error while running cifar10_aug_flow.py

pkulwj1994 commented 3 years ago

Hi Didrik. I have read your surVAE paper and got much interest on the topic.

Unfortunately, I meet svd_cup error while running cifar10_aug_flow.py

full bug traceback is below:

#########################

Traceback (most recent call last):

File "", line 13, in loss.backward()

File "/home/william/PRGRAMS/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph)

File "/home/william/PRGRAMS/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag

RuntimeError: svd_cpu: the updating process of SBDSDC did not converge (error: 23)

#########################

Think if you could release your experiments configuration, such like pytorch version, cuda version and so on. Believe these information will help to tackle down the problem, thank you!

pkulwj1994 commented 3 years ago

also meet the same problem while reproducing cifar10 maxpooling experiment by running

python train.py --epochs 500 --batch_size 32 --optimizer adamax --lr 1e-3 --gamma 0.995 --eval_every 1 --check_every 10 --warmup 5000 --num_steps 12 --num_scales 2 --dequant flow --pooling max --dataset cifar10 --augmentation eta --name maxpool

bugs come in this way

######################### Traceback (most recent call last): 15520/50000, Bits/dim: inf707093926360972611886776320.000 File "train.py", line 65, in exp.run() File "/home/william/WORKSPACE/01_PYTHON_PROJECT/04_surVAE_flow/survae_flows/experiments/image/experiment/flow.py", line 132, in run super(FlowExperiment, self).run(epochs=self.args.epochs) File "/home/william/WORKSPACE/01_PYTHON_PROJECT/04_surVAE_flow/survae_flows/experiments/image/experiment/base.py", line 122, in run train_dict = self.train_fn(epoch) File "/home/william/WORKSPACE/01_PYTHON_PROJECT/04_surVAE_flow/survae_flows/experiments/image/experiment/flow.py", line 141, in train_fn loss.backward() File "/home/william/PRGRAMS/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/william/PRGRAMS/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: svd_cpu: the updating process of SBDSDC did not converge (error: 11) #########################

My experiment environment is listed:

ubuntu 20.04LTS single gpu gtx-1080ti cuda 11.0 torch 1.7.1 torchvision 0.8.2

bty, I believe it would be great if you could release a colab notebook version experiment, Didrik!

Great thanks

didriknielsen commented 3 years ago

Hi! Thanks for your interest!

The Conv1x1 was updated to run on CPU (https://github.com/didriknielsen/survae_flows/commit/48613e9cf0d6d2426c26d99435a117fc00f7fbf0) due to low GPU utilization (https://github.com/didriknielsen/survae_flows/issues/3).

However, this was done after the original runs. I would therefore try passing slogdet_cpu=False to all Conv1x1 layers. Hope that helps!

robert-giaquinto commented 3 years ago

I had this issue too. Oddly enough, the problem stopped when I replaced the DenseNet based coupling layers with either simple convolutions or continuous mixture CDFs (from the Flow++ paper, https://arxiv.org/abs/1902.00275), both are available in my fork of the repo if you want to see their implementation. It makes sense that it's ultimately related to the Conv1x1 computation though.

pkulwj1994 commented 3 years ago

Sorry to response late, I tried the experiments on another machine and everything went correct! I guess maybe my Lab Computer config is too old or Cuda/Torch version not suitable for the project.

didriknielsen commented 3 years ago

That's great to hear!

didriknielsen / survae_flows

meet svd_cup error while running cifar10_aug_flow.py #13