NVlabs / nvdiffrec

Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".
Other
2.09k stars 222 forks source link

Confusing mesh result on nerd_ehead #11

Closed YuhsiHu closed 2 years ago

YuhsiHu commented 2 years ago

Thank you for the great work! I followed the one time setup on ubuntu 18.04. Python=3.9, CUDA=11.3, PyTorch=1.10. Then I had a trial on nerd_ehead by running: python train.py --config configs/nerd_ehead.json In dmtet_mesh folder, the mesh looks good : image But in mesh folder, the mesh.obj looks confusing: image I did not change any code in this repo so I did not know why this happened. Could you please help me solve this problem? Thank you for your time!

jmunkberg commented 2 years ago

Hello,

I just tried it on Ubuntu 20.04.3 LTS, and got this in the mesh folder image That particular scene varies a bit between runs, but we haven't seen a collapse like the one you report above.

Did you run the nerd_ehead.json config unmodified or with reduced batch size? Are the other configs working on your machine?

JHnvidia commented 2 years ago

I also haven't seen it, so it could just be an divergence in early training. You could try and see what the img_mesh_pass*.png files look like. If the problem persists you can try and add a line "lock_pos" : true to the .json config file as a workaround. This will disable position training/tuning during the second (mesh) pass, and should ensure you get the same geometry from both passes.

YuhsiHu commented 2 years ago

Here is the img_mesh_pass_000000.jpg: img_mesh_pass_000000 And the img_mesh_pass_000001.jpg looks like: img_mesh_pass_000001 I will try add this line and update the progress. Thank you!

YuhsiHu commented 2 years ago

Add "lock_pos":true to the json file did avoid that. However, the PSNR became lower.

MSE,      PSNR
0.00195455, 27.823
Base mesh has 7708 triangles and 3862 vertices.
MSE,      PSNR 
0.00389249, 24.451
Writing mesh:  out/nerd_ehead/mesh/mesh.obj
    writing 3862 vertices
    writing 6643 texcoords
    writing 3862 normals
    writing 7708 faces
Writing material:  out/nerd_ehead/mesh/mesh.mtl
Done exporting mesh

Other results are good(e.g. chair). Why did this happen?

jmunkberg commented 2 years ago

Just to confirm, you are indeed running with batch size 8 in the config? Smaller batch size means more noise in the gradients, and can be harder to train.

Another thing you could try is to instead of using "lock_pos" : true , you can lower the learning rate in the second pass. So replace the line "learning_rate": [0.03, 0.03], with, say, "learning_rate": [0.03, 0.01], or even "learning_rate": [0.03, 0.003],

That said, the default config should produce better results. I checked a recent run with the default config unmodified, and these are the first two frames from the img_mesh_pass from that particular run. img_mesh_pass_000000 img_mesh_pass_000001

YuhsiHu commented 2 years ago

I did not modify batch size or any other parameter. The first two frames from img_mesh_pass are as follows: img_mesh_pass_000000 img_mesh_pass_000001

jmunkberg commented 2 years ago

This looks much better! Are you saying that one particular training diverged and that it is now working on your end? We are seeing some variation from run to run (the SDF is randomly initialized etc.) but on the two runs we started this morning on two different machines (using the released code unmodified), both looked as expected. Here is img_mesh_pass_000050.png

img_mesh_pass_000050

JHnvidia commented 2 years ago

I've also re-ran the nerd_ehead.json config and it seems to behave ok on my end (first two img_mesh_pass).

img_mesh_pass_000000 img_mesh_pass_000001

Please try a fresh git clone, re-run the unmodified config and see if the problem persists. Since the training is stochastic there's a small chance that it's just some random fluke.

YuhsiHu commented 2 years ago

Thank you for your reply. I tried git clone again but there was no change on my machine. The result is the same if I did not add "lock_pos" : true to json file. Maybe I should set the learning rate lower?

jmunkberg commented 2 years ago

Yes, as discussed above, you can try reducing the learning rate, but the default config should generate better results than what your report. We tried reproducing your issue on two different machines, using the code and config from the repo unmodified and we get much better results, as posted above. Some variations between runs is expected, but I would expect results in line with what we posted above.