Improving training speeds

Ericxgao commented 1 year ago

I'm noticing that while each epoch completes rather quickly, there's a lot of time in between each one. Any ideas on how to improve this and find out where the bottlenecks are?

Ericxgao commented 1 year ago

Two things that helped on an 8xA100 setup:

Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters.

In response to this warning, I added strategy: ddp_find_unused_parameters_false to the trainer section of the config file.

UserWarning:

num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting 
persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)

In response to this warning, I added set persistent_workers=True in the DataLoader contructors in the Datamodule class.

flavioschneider commented 1 year ago

Thanks for following up with what worked for you. One thing that might make some iterations very slow is if you have uncropped music and some samples are very long e.g. you are training on 30s samples but you have 10min audio files. Other than that I'm not sure.

archinetai / audio-diffusion-pytorch-trainer

Improving training speeds #6