e-bug / volta

[TACL 2021] Code and data for the framework in "Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs"
https://aclanthology.org/2021.tacl-1.58/
MIT License
113 stars 24 forks source link

Experienced system hanging? #12

Closed juliuswang0728 closed 2 years ago

juliuswang0728 commented 2 years ago

Hi!

This may be more of asking if there's similar experience than really throwing an issue.

I've been experiencing system hanging (not sure from GPU, dataloader, or any other) while finetuning a pre-trained model on, e.g. NLVR2. It usually goes like, (1) hangs at the beginning of the first epoch and the first iteration, which never proceeds. (2) hangs at the iteration n, where n is some multiple of number of workers set in the starting script, and it never proceeds.

When it hangs, CPU / GPU utilization is down to zero, the system seems doing nothing. Did you have similar experience? if so, any advice to work around it? Thanks!

e-bug commented 2 years ago

Hi Julius,

I have never experienced this with VOLTA.

But I did have it with another repository I used, and the hanging would get better as I trained. Not sure what might cause this though.