test_network.py, fine_tune crashes

andrewjong commented 4 years ago

Hi Bharat, thank you for your work.

I'm running test_network.py. The code is crashing for me in the "Optimize the Network" section, which I believe is the 2D test-time supervision mentioned in the paper. Specifically, in the fine_tune() function, the second m.train() loop.

 ...
Ep: 46, rendered :199.89004516601562, laplacian :nan, J_2d :675.6401977539062
Ep: 47, rendered :200.08740234375, laplacian :nan, J_2d :675.3319702148438
Ep: 48, rendered :200.31394958496094, laplacian :nan, J_2d :675.0046997070312
Ep: 49, rendered :200.67727661132812, laplacian :nan, J_2d :674.6705932617188
Ep: 0, rendered :20111.421875, laplacian :nan, J_2d :6.743411064147949
Ep: 1, rendered :20095.291015625, laplacian :nan, J_2d :nan
2020-03-25 16:22:59.837638: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at svd_op_gpu.cu.cc:139 : Internal: tensorflow/core/kernels/cuda_solvers.cc:628: cuSolverDN call failed with status =6
Traceback (most recent call last):
  File "test_network.py", line 198, in <module>
2020-03-25 16:22:59.837759: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at svd_op_gpu.cu.cc:181 : Invalid argument: Got info = 2 for batch index 0, expected info = 0. Debug_info = gesvd
    m = fine_tune(m, dat, dat, display=False)
  File "test_network.py", line 147, in fine_tune
    lo = m.train(inp, out, loss_dict=losses_2d, vars2opt=vars2opt)
  File "/home/andrew/Development/MultiGarmentNetwork/network/base_network.py", line 437, in train
    out_dict = self.call([images, vertex_label, J_2d])
  File "/home/andrew/Development/MultiGarmentNetwork/network/base_network.py", line 375, in call
    v, t, n, _ = self.smpl(p, betas, t, offsets_)
  File "/home/andrew/.miniconda3/envs/mgn/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 592, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/andrew/Development/MultiGarmentNetwork/smpl/batch_smpl.py", line 182, in call
    s, u, v = tf.svd(theta)
  File "/home/andrew/.miniconda3/envs/mgn/lib/python3.7/site-packages/tensorflow/python/ops/linalg_ops.py", line 418, in svd
    tensor, compute_uv=compute_uv, full_matrices=full_matrices, name=name)
  File "/home/andrew/.miniconda3/envs/mgn/lib/python3.7/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 2108, in svd
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: tensorflow/core/kernels/cuda_solvers.cc:628: cuSolverDN call failed with status =6 [Op:Svd]

I'm guessing this is caused by J_2d being nan. Any idea what the problem could be, please, or advice about how to debug? Thanks!

neonb88 commented 4 years ago

Did you use the test_data.pkl provided? I wasn't encountering that problem, although I did have problems because my K80 GPU didn't have enough memory for the large images (ie. high-resolution vide) I was trying to use

andrewjong commented 4 years ago

Hi Nathan! Yup this is test_data.pkl, split to just 1 batch to fit on my GPU. That's interesting you don't have that problem. ~Maybe the problem came from when I tried to bump the repo to Python 3.~ Edit: Nevermind, saw your other comment saying that you also used Python 3.

neonb88 commented 4 years ago

Hi Andrew, wish I had a moment to take a look. Please keep us updated; Bharat is pretty good about checking these issues. It seems he's more likely to answer if he can answer the question in like one line, though, haha:

bharat-b7 commented 4 years ago

I'm unable to reproduce this issue. From the post, it seems that the loss is blowing up. Can you try reducing the learning rate?

bharat-b7 / MultiGarmentNetwork

test_network.py, fine_tune crashes #18