atosystem / SpeechCLIP

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022
https://atosystem.github.io/blogs/speechclip
BSD 3-Clause "New" or "Revised" License
108 stars 6 forks source link

Training on Flickr Dataset Unexpectedly Hangs #6

Open mhamzaerol opened 8 months ago

mhamzaerol commented 8 months ago

Hello,

First of all, thank you very much for this work and your efforts! The repository and guidelines are succinct and pretty effective!

I've encountered a recurring issue while training the large parallel model on the Flickr dataset. The training process unexpectedly hangs - no updates appear in the terminal or the wandb logs. This occurred at approximately 2.7k steps during the first run and around 32k steps in the second. The Conda environment I am using has Python3.10 set, and I was running the experiments on 4 A5000 GPUs.

Currently, I am resuming training from the latest checkpoint by using the resume flag in the training script as a workaround, whenever the training process halts.

I am curious if this is a known issue. Are there components in the code that might cause such behavior, particularly with my setup? Additionally, is resuming training a recommended approach, or are there other flags/settings I should consider?

Any insights or suggestions you can provide would be greatly appreciated.

Thank you!

atosystem commented 8 months ago

Hi @mhamzaerol! Can you show the screenshot "when the training process unexpectedly hangs"? Thanks

mhamzaerol commented 8 months ago

Hi,

Thank you very much for the quick response! Here, I am attaching the screenshots of:

when an instance of a run hangs unexpectedly.

wandb terminal

Thank you!

atosystem commented 7 months ago

I did not encounter this kind of problem before. I'm guessing this has something to do with multi-gpu training. Does this problem also occur if you use single GPU? (I believe Parallel Base can fit in single GPU with a small batch size) If single GPU works, I suggest you can look into validation functions in the pytorch training.

mhamzaerol commented 7 months ago

Thank you very much for the feedback! I actually never attempted training in a single GPU setting. I will update you, in case I encounter a similar issue.