facebookresearch / AVT

Code release for ICCV 2021 paper "Anticipative Video Transformer"
Apache License 2.0
152 stars 28 forks source link

Job exiting automatically after a while #26

Closed Anirudh257 closed 2 years ago

Anirudh257 commented 2 years ago

I am training the Epic-55 dataset end-to-end only on the videos. I followed all the steps in the repository and was able to set up the training. My model trains for a while but gets canceled automatically.

I get this error:

MicrosoftTeams-image (2)

My batch file submitted is:

#SBATCH --cpus-per-task=10
#SBATCH --error=/scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_0_log.err
#SBATCH --gres=gpu:v100:4,nvme:100
#SBATCH --job-name=AVT
#SBATCH --mem=200GB
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --open-mode=append
#SBATCH --output=/scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_0_log.out
#SBATCH --partition=gpu
#SBATCH --signal=USR1@120
#SBATCH --time=4320
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --output /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_%t_log.out --error /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j/%j_%t_log.err --unbuffered /scratch/project_2000255/anwer/antic_trans/AVT/avt_env/bin/python -u -m submitit.core._submit /scratch/project_2000255/anwer/antic_trans/AVT/OUTPUTS/expts/09_ek55_avt.txt/.submitit/%j
~                       

Is this issue due to the different number of nodes as specified in the config? I don't have 4 nodes of 8 GPUs.