Long training time + initial increase in KL divergence

djr2015 commented 4 years ago

Thank you for your work and accompanying codebase!

1) I am able to run scripts/box_vae_chair.sh, but I am finding it will take far longer (~1h per epoch ~= 8.3 days) to get to 200 training epochs than the 1-2 days your paper mentioned for bounding box inputs using:

GeForce GTX TITAN X GPU and a Intel Xeon E5-2630 CPU, whose specs are similar to yours (I also tried it out on a machine with Titan XP + i7-5930K)
CUDA 10.0, PyTorch 1.1.0, torch-scatter 1.1.0 and torchvision 0.3.0
varying numbers in torch.set_num_threads(4) but this did not affect training time (including with a number of threads >4)

2) For my training (first 10 epochs) thus far KL divergence is increasing, I wondered if this was expected behavior? KLdiv

daerduoCarey commented 4 years ago

StructureNet indeed needs longer training time, as it is mostly using CPU for training (as it's hard to batch the operations). Make sure you have a decent CPU (at least i7/i9). But it should be done (getting reasonable progress that close to convergence) after 1-3 days of training.
KLDiv will go up first, and then perform as a regularizor in the later training.

Warren-swr commented 1 year ago

Hello, I also encountered the problem of too long training time. What can I do to speed up the training? I noticed that torch.set_num_threads() can be set in the training script. If I have a multi-core CPU, such as 20 cores, can I speed up training by using more CPU threads? I noticed your comment, that don't use too many CPU threads, But does this also affect CPUs with more cores?

daerduoCarey / structurenet

Long training time + initial increase in KL divergence #6