Closed icekang closed 5 months ago
It was a misunderstanding. I monitor the log from SLURM, but the output of the nnUNet_training was redirected to the text file and the main process (which I think SLURM did not capture as it tracked different system output.)
It was a misunderstanding. I monitor the log from SLURM, but the output of the nnUNet_training was redirected to the text file and the main process (which I think SLURM did not capture as it tracked different system output.)
I am having exactly the same problem, can you please clarify on what solved the issue. Thanks
On a cluster environment the reported output (stdout) is often incomplete. This is why we have the training log files. Please just look at those and the progress.png plots to assess whether a job is running correctly
On a cluster environment the reported output (stdout) is often incomplete. This is why we have the training log files. Please just look at those and the progress.png plots to assess whether a job is running correctly
Thank you for your prompt response. The problem is that the training stuck in the following lines and no changes happen in log/progress.png files:
2024-05-06 22:26:35.217165: unpacking done... 2024-05-06 22:26:35.218131: do_dummy_2d_data_aug: False 2024-05-06 22:26:35.286218: Unable to plot network architecture: 2024-05-06 22:26:35.286608: No module named 'hiddenlayer' 2024-05-06 22:26:35.416393: 2024-05-06 22:26:35.416793: Epoch 0 2024-05-06 22:26:35.417209: Current learning rate: 0.01 using pin_memory on device 0
However, using interactive gpus, the process runs okay and the model is trained.
I got a very strange error where it took forever to unpack the dataset, (I checked and all the data have been extracted, something is holding it back). When I ran an interactive job on SLURM, this did not happen and it could proceed to the training process.
When I submit the job
When I ran interactively
The submission script I used for both job submission and running interactively.