Closed Anirudh257 closed 2 years ago
I am training the Epic-55 dataset end-to-end only on the videos. I followed all the steps in the repository and was able to set up the training. My model trains for a while but gets canceled automatically.
I get this error:
My batch file submitted is:
#SBATCH --cpus-per-task=10 #SBATCH --error=/scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_0_log.err #SBATCH --gres=gpu:v100:4,nvme:100 #SBATCH --job-name=AVT #SBATCH --mem=200GB #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --open-mode=append #SBATCH --output=/scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_0_log.out #SBATCH --partition=gpu #SBATCH --signal=USR1@120 #SBATCH --time=4320 #SBATCH --wckey=submitit # command export SUBMITIT_EXECUTOR=slurm srun --output /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_%t_log.out --error /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_%t_log.err --unbuffered /scratch/project_2000255/anwer/antic_trans/AVT/avt_env/bin/python -u -m submitit.core._submit /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j ~
Is this issue due to the different number of nodes as specified in the config? I don't have 4 nodes of 8 GPUs.
I am training the Epic-55 dataset end-to-end only on the videos. I followed all the steps in the repository and was able to set up the training. My model trains for a while but gets canceled automatically.
I get this error:
My batch file submitted is:
Is this issue due to the different number of nodes as specified in the config? I don't have 4 nodes of 8 GPUs.