Open leonwu0108 opened 1 year ago
Hi there!
Thanks a lot for mentioning this issue! 90min is definitely way too long and indicates an issue somewhere.
There are a couple of things that may have contributed to the long training time:
has_apex True
. This significantly speeds up training and it should be automatic with the provided docker but it's best to double check. --no_viewer
. The 3D viewer will hook onto the training loop and render at every frame which can also slow down things. with_tensorboard: true
flag in the train_permuto_sdf.cfg
can also slow down things. --with_mask
. async_create_samples_cleaned
branch. However it is not yet merged since the code is quite difficult to follow and I'm still trying to find an elegant way to refactor it. Note that pulling this branch also requires to pull the latest version of the permutohedral_encoding package.I hope there were no more performance regressions that have occurred due to refactoring so I will keep the issue open until I double check everything and also merge the async branch.
One tangential point to mention is that you can get significantly faster training by compressing the schedule using s_mult
from here. You can set it for example to 0.5
to half the training time with almost no loss in accuracy for the vast majority of objects.
Hello, I checked the setting.It looks right.
On the master branch, it took 28 minutes to converge (a single RTX 4090). I think the 4090 should be much faster than the 3090.
Do I need to switch from the master branch to the async_create_samples_cleaned
branch? Is there a difference in reconstruction accuracy between these two branches?
Hello, First of all, thanks for your excellent work and code release! When I was trying to repeat the experiments with the given docker environment on my local workstation (a single RTX 3090), I noticed that the training process (20w iters with default hyper-params) on a single DTU scan (
dtu_scan24
) took about 90mins to complete, which is much longer than the officially claimed training time in the paper conclusion part ($\approx$ 30mins). I'm curious if it was normal. Was it out of that the default parameters in the code were not the same as the parameters you used to measure the training time, or was it for some other reason? I'll be appreciate to get the answer.