This may be more of asking if there's similar experience than really throwing an issue.
I've been experiencing system hanging (not sure from GPU, dataloader, or any other) while finetuning a pre-trained model on, e.g. NLVR2.
It usually goes like,
(1) hangs at the beginning of the first epoch and the first iteration, which never proceeds.
(2) hangs at the iteration n, where n is some multiple of number of workers set in the starting script, and it never proceeds.
When it hangs, CPU / GPU utilization is down to zero, the system seems doing nothing.
Did you have similar experience? if so, any advice to work around it?
Thanks!
Hi!
This may be more of asking if there's similar experience than really throwing an issue.
I've been experiencing system hanging (not sure from GPU, dataloader, or any other) while finetuning a pre-trained model on, e.g. NLVR2. It usually goes like, (1) hangs at the beginning of the first epoch and the first iteration, which never proceeds. (2) hangs at the iteration
n
, wheren
is some multiple ofnumber of workers
set in the starting script, and it never proceeds.When it hangs, CPU / GPU utilization is down to zero, the system seems doing nothing. Did you have similar experience? if so, any advice to work around it? Thanks!