Closed monajalal closed 1 year ago
Hi all,
I have the same issue on my side. The main point is that the idle shutdown activate while a job is running so I consider it's not directly linked to PyTorch. Where you able to find a solution to avoid this ?
I was either using tmux on my side. The overall process steps to reproduce is quite the same as described by @monajalal
Many thanks
I ran a training code inside an Azure Compute node with 4 GPUs where the input data was read from Azure DataStore blob storage and the output data was written unto the same blob storage.
My code ran for 11 epochs and stopped and since the node was idle for 1 hour the node stopped since I had activated the idle shutdown.
Since I ran the code inside a tmux session, and I expected it to run for 60 epochs, I cannot identify the source of error. Is there anyway you could assist me figure the source of failure?
I am running the same exact train code with same data on a local server using a tmux session and I don't have any problem and it's on epoch 30 now.
Here are some messages I see in dmesg tail
and here are the GPUs:
My code uses DataParallel for training mechanism in PyTorch and I utilized all 4 of the available GPUs.
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen. To be able to figure how to debug the root cause of failure in training.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.