Train a model based on synface dataset and result is not good as paper

elliottwu / unsup3d

(CVPR'20 Oral) Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild

MIT License

1.19k stars 194 forks source link

Train a model based on synface dataset and result is not good as paper #15

Closed diamond0910 closed 3 years ago

diamond0910 commented 4 years ago

Hi! Thank you very much for your excellent work!

I use the provided script python run.py --config experiments/train_celeba.yml --gpu 0 --num_workers 4 to train a model for the synface dataset.

And then I use python run.py --config experiments/test_celeba.yml --gpu 0 --num_workers 4 to test the model.

Finally, I got 0.0092±0.002 SIDE and 17.77±1.92 MAD, which is not good as in the paper(0.793 ±0.140 and 16.51 ±1.56 MAD in Table 2).

May I have a problem with my operation?

Thank you!

elliottwu commented 4 years ago

Hi, I assume you were using experiments/train_synface.yml and experiments/test_synface.yml, but just to confirm. Also can you confirm that you are using exactly the same setting, ie, eg same batch size, same number epochs etc? If that's true, it might also be related to the CUDA version (or pytorch version), as was reported in this thread. I have not tested this yet. It would also be helpful to share some visualization your results, and the environment you are using. Thanks!

diamond0910 commented 4 years ago

Thank you for your reply! I will show the phenomenon of my training for your reference.

Using the same code, I got 0.0079±0.0014 SIDE and 16.24±1.52 MAD using CUDA 9.0, and got 0.0092±0.0020 SIDE, 17.77±1.92 MAD using cuda 10.2.

This result surprised me. The performance of the same code differs so much under different versions of cuda. Is there any explanation here?

Thank you!

Heng14 commented 2 years ago

Hi, I think I met a similar problem. I use CUDA 11.4 and torch 1.9. I did not change anything in experiments/test_synface.yml. But it seems did not converge. MAD is at around 50 and SIDE is at around 0.2. Sometimes they even go to nan. I had tried many times and they all failed to converge. I am wondering if you have any hints on this problem. Thank you!

diamond0910 commented 2 years ago

I solve this problem by downgrading my cuda version to cuda 10.2. I guess there may be some function precision problem.

Best.

Heng14 commented 2 years ago

Thank you! I also use cuda 10.2 instead and the training process now converges.Hope the problem can be solved on cuda 11 in the future.

diamond0910 commented 2 years ago

Oh, i remember it did not work in cuda 11, so I used cuda 10.2