Digital-Humans-23 / a2

4 stars 0 forks source link

Job crashed for log_std -1.0 #8

Closed JunTu-XD closed 1 year ago

JunTu-XD commented 1 year ago

Hi TAs/folks, Hope you are doing well!

I am facing an issue that for ex.2. When setting log_std to -1.0, it keeps crashing and stops at around global step 45, approx 15 min. I tried several times, it stopped at the same step and same time. The log_std -2.0 and 0.0 are training well until now. So I cannot think about a reasonable explanation for this issue. And the .err output just says one line '/usr/bin/xvfb-run: line 186: kill: (61108) - No such process'.

Thanks! Jun

fzargarbashi commented 1 year ago

Hi, Are you running several training sessions in parallel? In that case, try again with only one experiment and see if the error persists. Also it is always a good idea to check if you have enough storage in the server (use lquota to see).

CHENGEZ commented 1 year ago

I encounter the same issue as well. I believe it's the problem with Euler. I was only running one job yet it just gets killed for whatever reason after some time. It took me 4 trials to get log_std = -2 to complete running, it is now my 4th trial for log_stg=-1 and it has been running for an hour, and I don't know whether it will enventually succeed. The behaviour of Euler is so unstable as if it just depends on luck whether the job will eventually complete or get killed.

But technically shouldn't we be able to run 3 jobs at the same time? since according to the FAQ of Euler, even guest users have at most 48 cores at the same time, yet each of our jobs only require 16. Nevertheless the job just keeps getting killed even though it is the only job submitted. And the only fix now seems to be keep trying until it doesn't get killed, which is really frustrating.

fzargarbashi commented 1 year ago

I tried to reproduce the issue. Seems like it is related to loading the modules. Please check if you do all of the following steps:

  1. ssh (with -Y) to Euler
  2. conda activate pylocoEnv
  3. module load gcc/8.2.0 python/3.9.9 cmake/3.25.0 freeglut/3.0.0 libxrandr/1.5.0 libxinerama/1.1.3 libxi/1.7.6 libxcursor/1.1.14 mesa/17.2.3 eth_proxy
  4. cd to the project main folder
  5. sbatch ./jobs/02_gaussian_reward

When I tried without loading the modules, I got the same error. But with loading modules it disappears.

Please let me know if the issue is solved or not.

CHENGEZ commented 1 year ago

Hi,

As I mentioned in the previous comment, retrying sometimes fix the problem. And in fact my 4th trial of log_stg=-1 is still running (for 5 hours up till now), and haven't yet crashed. I am pretty sure each time I make a trial of running it my commands should be the same, because I am a lazy person and simply put all commands that is needed to run a job in a RunJob0X.sh script. Below is an example of the RunJob02.sh .

#!/bin/bash
source ~/.bashrc
conda activate pylocoEnv
cd build
env2lmod
module load gcc/8.2.0 python/3.9.9 cmake/3.25.0 freeglut/3.0.0 libxrandr/1.5.0  libxinerama/1.1.3 libxi/1.7.6  libxcursor/1.1.14 mesa/17.2.3 eth_proxy
cmake -DPython_EXECUTABLE=/cluster/home/chengyi/miniconda3/envs/pylocoEnv/bin/python3 -DCMAKE_BUILD_TYPE=Release ../
make
cd ..
env2lmod
module load gcc/8.2.0 python/3.9.9 cmake/3.25.0 freeglut/3.0.0 libxrandr/1.5.0  libxinerama/1.1.3 libxi/1.7.6  libxcursor/1.1.14 mesa/17.2.3 eth_proxy  
sbatch ./jobs/02_gaussian_reward

As for RunJob01.sh and RunJob02.sh the only difference is the last line in the script. (change to the corresponding job file)

Since I simply run the script instead of typing all the commands each time, I believe all commands in this script are executed and each time the execution must be identical. However, just by running this script, I sometimes encounter the issue, sometimes not. It sometimes work the first time, sometimes it fixes itself after a few trials, somtimes it just doesn't work. (In fact up to now I still haven't yet managed to successfully run job 01 and 03, all with the same issue, which is that there have been some training outputs already: image But at some point (e.g. 30 mins after starting) it says "GLFW initilization failed" in the .out file, and '/usr/bin/xvfb-run: line 186: kill: (SOME_ID) - No such process'. in the .err file.

MiguelZamoraM commented 1 year ago

@CHENGEZ At a certain point during testing a member of our group encountered a similar issue and the "solution" was to use a virtual environment. Given that you were having problems with conda before, would you mind giving it a try to use the virtual environment?

JunTu-XD commented 1 year ago

I tried to reproduce the issue. Seems like it is related to loading the modules. Please check if you do all of the following steps:

1. ssh (with -Y) to Euler

2. conda activate pylocoEnv

3. module load gcc/8.2.0 python/3.9.9 cmake/3.25.0 freeglut/3.0.0 libxrandr/1.5.0  libxinerama/1.1.3 libxi/1.7.6  libxcursor/1.1.14 mesa/17.2.3 eth_proxy

4. cd to the project main folder

5. sbatch ./jobs/02_gaussian_reward

When I tried without loading the modules, I got the same error. But with loading modules it disappears.

Please let me know if the issue is solved or not.

Thanks for the suggestion! I definitely do this before each run. And that's the wired part. I run 3 in parallel, but only the gaussian with log_std -1 was killed. And later I tried it when the rest of jobs are finished, it is still being killed at the same point of around 15 min. Anyway, I will run it again this evening. We will see it tomorrow. Will keep you update. Luckily, it is keeping alive for over 5 hours.

JunTu-XD commented 1 year ago

@CHENGEZ At a certain point during testing a member of our group encountered a similar issue and the "solution" was to use a virtual environment. Given that you were having problems with conda before, would you mind giving it a try to use the virtual environment?

Actually, I am using the venv but facing this issue. So this might not be the root cause and still depend on luck maybe.

CHENGEZ commented 1 year ago

@CHENGEZ At a certain point during testing a member of our group encountered a similar issue and the "solution" was to use a virtual environment. Given that you were having problems with conda before, would you mind giving it a try to use the virtual environment?

Thank you! I would certainly try virtual env if my current runs fails again. I sticked to conda for now cause the previous conda issue was fixed by adding source ~/.bashrc in the job script. But it seems that according to @JunTu-XD, using virtual env doesn't really solve the current issue. And I personally think it makes sense, because if it was an environment issue, it should have crashed immediately when the job starts running since that's when all import statements are executed. Therefore I still believe it's related to how Euler handle its resources and kill jobs.

By the way, @JunTu-XD , my log_stg=-1 has finished successfully just now. And I am currently running log_stg=0. So maybe it's not realted to the log_stg value after all.

JunTu-XD commented 1 year ago

Mine is still alive. I think might be some issue associated with certain working node. Anyway, this might be something out of scope. Tomorrow if mine succeed, I will close this issue.

xuexianlim commented 1 year ago

/usr/bin/xvfb-run: line 186: kill: (122388) - No such process

I too got this error with the default log std. I will try again.

xiyichen commented 1 year ago

3. module load gcc/8.2.0 python/3.9.9 cmake/3.25.0 freeglut/3.0.0 libxrandr/1.5.0 libxinerama/1.1.3 libxi/1.7.6 libxcursor/1.1.14 mesa/17.2.3 eth_proxy

I got the same error