Digital-Humans-23 / a2

4 stars 0 forks source link

Job crashes when computer is turned off? #12

Open kstavratis opened 1 year ago

kstavratis commented 1 year ago

I've been trying to run the batch file provided with the commands sbatch after overcoming any hurdles with conda environments et cetera.

Currently, I'm having an issue with the Euler: it seems to stop executing the job (and thus ends in with a "crashed" status) whenever I turn off my computer. Has this happened to anyone and if yes, how did they resolve it? I submit the job according to the instructions provided in the README.md:

# Before you start a job, make sure to run the following two commands, every time you start a new ssh connection to Euler.
$ env2lmod
$ module load gcc/8.2.0 python/3.9.9 cmake/3.25.0 freeglut/3.0.0 libxrandr/1.5.0  libxinerama/1.1.3 libxi/1.7.6  libxcursor/1.1.14 mesa/17.2.3 eth_proxy  
# Submit job
$ sbatch ./jobs/03_bob   ` 
MiguelZamoraM commented 1 year ago

That should not be the case. The sbatch command runs on the server and the job should run even if you close your ssh session.

yolkarian commented 1 year ago

I have also experienced the same. Job 3 will crash after running for approx. 16h or 15h59min.

The reason might be: The limit of the running time of job 3 has been set to 16h. It seems that the time needed for training is more than 16 hours. I

MiguelZamoraM commented 1 year ago

@yolkarian If your experiment timed out see this.