Closed pengzhangzhi closed 7 months ago
Running into a similar issue training the conformer large model
in a docker container with the latest nvcr.io/nvidia/nemo:23.10
image, on p2.16xlarge (V100 instances). What is your training environment like?
This is my docker container
nvcr.io/nvidia/clara/bionemo-framework:latest "/workspace/bionemo/…" 4 days ago Up 13 hours bionemo
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
I am facing the same issue on a multi node multi GPU and without docker. I am utilizing slurm to run the job.
I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.
The configs are: