Closed YuhsiHu closed 2 years ago
Hello,
I just tried it on Ubuntu 20.04.3 LTS, and got this in the mesh folder
That particular scene varies a bit between runs, but we haven't seen a collapse like the one you report above.
Did you run the nerd_ehead.json config unmodified or with reduced batch size? Are the other configs working on your machine?
I also haven't seen it, so it could just be an divergence in early training. You could try and see what the img_mesh_pass*.png
files look like. If the problem persists you can try and add a line "lock_pos" : true
to the .json
config file as a workaround. This will disable position training/tuning during the second (mesh) pass, and should ensure you get the same geometry from both passes.
Here is the img_mesh_pass_000000.jpg:
And the img_mesh_pass_000001.jpg looks like:
I will try add this line and update the progress. Thank you!
Add "lock_pos":true
to the json
file did avoid that. However, the PSNR became lower.
MSE, PSNR
0.00195455, 27.823
Base mesh has 7708 triangles and 3862 vertices.
MSE, PSNR
0.00389249, 24.451
Writing mesh: out/nerd_ehead/mesh/mesh.obj
writing 3862 vertices
writing 6643 texcoords
writing 3862 normals
writing 7708 faces
Writing material: out/nerd_ehead/mesh/mesh.mtl
Done exporting mesh
Other results are good(e.g. chair). Why did this happen?
Just to confirm, you are indeed running with batch size 8 in the config? Smaller batch size means more noise in the gradients, and can be harder to train.
Another thing you could try is to instead of using "lock_pos" : true
, you can lower the learning rate in the second pass. So replace the line
"learning_rate": [0.03, 0.03],
with, say, "learning_rate": [0.03, 0.01],
or even "learning_rate": [0.03, 0.003],
That said, the default config should produce better results. I checked a recent run with the default config unmodified, and these are the first two frames from the img_mesh_pass from that particular run.
I did not modify batch size or any other parameter. The first two frames from img_mesh_pass are as follows:
This looks much better! Are you saying that one particular training diverged and that it is now working on your end? We are seeing some variation from run to run (the SDF is randomly initialized etc.) but on the two runs we started this morning on two different machines (using the released code unmodified), both looked as expected. Here is img_mesh_pass_000050.png
I've also re-ran the nerd_ehead.json
config and it seems to behave ok on my end (first two img_mesh_pass).
Please try a fresh git clone
, re-run the unmodified config and see if the problem persists. Since the training is stochastic there's a small chance that it's just some random fluke.
Thank you for your reply. I tried git clone again but there was no change on my machine. The result is the same if I did not add "lock_pos" : true
to json
file. Maybe I should set the learning rate lower?
Yes, as discussed above, you can try reducing the learning rate, but the default config should generate better results than what your report. We tried reproducing your issue on two different machines, using the code and config from the repo unmodified and we get much better results, as posted above. Some variations between runs is expected, but I would expect results in line with what we posted above.
Thank you for the great work! I followed the one time setup on ubuntu 18.04. Python=3.9, CUDA=11.3, PyTorch=1.10. Then I had a trial on nerd_ehead by running:
But in mesh folder, the mesh.obj looks confusing:
I did not change any code in this repo so I did not know why this happened. Could you please help me solve this problem? Thank you for your time!
python train.py --config configs/nerd_ehead.json
In dmtet_mesh folder, the mesh looks good :