Question about the training time on DTU

leonwu0108 commented 1 year ago

Hello, First of all, thanks for your excellent work and code release! When I was trying to repeat the experiments with the given docker environment on my local workstation (a single RTX 3090), I noticed that the training process (20w iters with default hyper-params) on a single DTU scan (dtu_scan24) took about 90mins to complete, which is much longer than the officially claimed training time in the paper conclusion part ($\approx$ 30mins). I'm curious if it was normal. Was it out of that the default parameters in the code were not the same as the parameters you used to measure the training time, or was it for some other reason? I'll be appreciate to get the answer.

RaduAlexandru commented 1 year ago

Hi there!

Thanks a lot for mentioning this issue! 90min is definitely way too long and indicates an issue somewhere.

There are a couple of things that may have contributed to the long training time:

Please check that apex package was found by the training script. This will show up while training with a message that says has_apex True. This significantly speeds up training and it should be automatic with the provided docker but it's best to double check.
If you want to get the fastest training it is also best to disable the viewer by using --no_viewer. The 3D viewer will hook onto the training loop and render at every frame which can also slow down things.
Logging images on tensorboard by using the with_tensorboard: true flag in the train_permuto_sdf.cfg can also slow down things.
Running with the default settings will assume that you have no mask for the images so the background will need to be modeled by a NeRF network. In the case of DTU you can enable the masked images with --with_mask.
Recently I found a bug that was causing the GPU to stall at every iteration due to an unnecessary amount of copies to CPU for logging purposes. This has now been fixed so please pull the latest version.
Another aspect that causes a GPU to CPU synchronization is the creation of the foreground ray samples. On the master branch the creation of samples will stall the GPU at this line. This is fixed in the async_create_samples_cleaned branch. However it is not yet merged since the code is quite difficult to follow and I'm still trying to find an elegant way to refactor it. Note that pulling this branch also requires to pull the latest version of the permutohedral_encoding package.

I hope there were no more performance regressions that have occurred due to refactoring so I will keep the issue open until I double check everything and also merge the async branch.

One tangential point to mention is that you can get significantly faster training by compressing the schedule using s_mult from here. You can set it for example to 0.5 to half the training time with almost no loss in accuracy for the vast majority of objects.

Ice-Tear commented 10 months ago

Hello, I checked the setting.It looks right. On the master branch, it took 28 minutes to converge (a single RTX 4090). I think the 4090 should be much faster than the 3090. Do I need to switch from the master branch to the async_create_samples_cleaned branch? Is there a difference in reconstruction accuracy between these two branches?

RaduAlexandru / permuto_sdf

Question about the training time on DTU #4