Addressing Bottlenecks in Training

Audio-WestlakeU / FN-SSL

The Official PyTorch Implementation of FN-SSL & IPDnet for Sound Source Localization

81 stars 8 forks source link

Addressing Bottlenecks in Training #4

Closed YuseonChoi closed 4 weeks ago

YuseonChoi commented 2 months ago

I am using two RTX 3090 GPUs to run train.py following the provided guide.

I only used one FullNarrowBlock, but the training process took 5 hours to complete just one of the 15 epochs. The training process took too much time than I expected.

When I Checked the GPU utilization, it seemed there was a bottleneck somewhere in the code. I suspect the bottleneck might be in the data loading and processing part.

I am wondering if this is a normal occurrence. If there is something wrong, could you give me an advice how to deal with it?

wangyabo123 commented 2 months ago

Sorry for the late response. I recommend using the lightning version of the code for training. It is faster than the torch version. Using mixed precision in the lightning version will further speed up the training. Regarding the bottleneck issue you mentioned, one possible reason is that we are saving too many attributes when storing the simulation data (Dataset.AcousticScene), which might cause an I/O bottleneck during training. You can try removing unnecessary attributes when saving the simulation data.

wangyabo123 commented 2 months ago

In the lightning version, you can add '--trainer.precision=16-mixed' to the train command to enable mixed precision training. python main.py fit --data.batch_size=[*,*] --trainer.devices=*,* (--trainer.precision=16-mixed)