Open mxochicale opened 1 year ago
Hi @harveymannering
It seems we are getting an error but not sure what's the issue. I just put the reference here to hopefully sort this one.
Eqw: there was an error in this jobscript. This will not run.
> https://www.rc.ucl.ac.uk/docs/howto/#job-states Thanks, Miguel
I'm getting an Eqw state as well. I will try again having created the ~/runLog directory.
$ qstat -explain E -j 2266220
==============================================================
job_number: 2266220
exec_file: job_scripts/2266220
submission_time: Fri Jul 28 16:03:08 2023
owner: cceajes
uid: 280582
group: cceas2
gid: 70037
sge_o_home: /home/cceajes
sge_o_log_name: cceajes
sge_o_path: /shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/bin:/shared/ucl/apps/cuda/11.3.1/gnu-10.2.0/bin:/shared/ucl/apps/cuda/11.3.1/gnu-10.2.0/nsight-compute-2021.1.1:/home/cceajes/.python3local/bin:/shared/ucl/apps/python/bundles/gnu-10.2.0/python39-6.0.0/venv/bin:/shared/ucl/apps/openblas/0.3.13-serial/gnu-10.2.0/bin:/shared/ucl/apps/python/3.9.6/gnu-10.2.0/bin:/shared/ucl/apps/gcc/10.2.0-p95889/bin:/home/cceajes/miniconda3/bin:/home/cceajes/miniconda3/condabin:/shared/ucl/apps/cluster-bin:/shared/ucl/apps/cluster-scripts:/shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/tmux/3.3a/bin:/shared/ucl/apps/emacs/28.1/bin:/shared/ucl/apps/giflib/5.1.1/gnu-4.9.2/bin:/shared/ucl/apps/dos2unix/7.3/gnu-4.9.2/bin:/shared/ucl/apps/NEdit/5.6-Aug15/bin:/shared/ucl/apps/nano/2.4.2/gnu-4.9.2/bin:/shared/ucl/apps/GERun:/shared/ucl/apps/screen/4.9.0/bin:/shared/ucl/apps/subversion/1.14.1/bin:/shared/ucl/apps/apr-util/1.6.1/bin:/shared/ucl/apps/apr/1.7.0/bin:/shared/ucl/apps/git/2.32.0/gnu-4.9.2/bin:/shared/ucl/apps/flex/2.5.39/gnu-4.9.2/bin:/shared/ucl/apps/cmake/3.21.1/gnu-4.9.2/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/home/cceajes/.local/bin:/home/cceajes/bin
sge_o_shell: /bin/bash
sge_o_workdir: /lustre/home/cceajes/text_to_image/TextToImageModels/dreambooth
sge_o_host: login13
account: policyjsv;F=1;J=1;B=0;E=1;H=0;D=0;I=0;L=1
reserve: y
merge: y
hard resource_list: batch=true,gpu=1,h_rt=3600,memory=8G,snx=1
mail_list: cceajes@login13.myriad.ucl.ac.uk
notify: FALSE
job_name: Dreambooth
stdout_path_list: NONE:NONE:~/runLog/
jobshare: 0
restart: n
shell_list: NONE:/bin/bash
env_list: TERM=NONE,XAUTHORITY=/scratch/scratch/cceajes/.Xauthority,PAID=0,GPU=1,SGE_UCL_MEM=8589934592,MICCOUNT=0,SCRATCH_SPACE=10737418240,SGE_ONE=1,SGE_SHARENODE=1,IFS=
script_file: finetuning.qsub.sh
parallel environment: smp-[FJEL]* range: 1
project: AllUsers
binding: set linear:slots
job_type: NONE
error reason 1: 07/28/2023 21:05:00 [280582:177077]: error: can't open output file "/home/cceajes/runLog/": Is a directory
scheduling info: (Collecting of scheduler job information is turned off)
Did you guys try the qexplain
command? I'm not sure why this would be failing, but I would be curious to see the explanation given by qexplain
.
It was caused by the /home/cceajes/runLog/
directory not existing. qexplain
looks to be a wrapper around qstat -j
so would give a very similar thing to above.
I'm surprised qsub
doesn't create the runLog folder itself. I suppose you could just run mkdir runLog
yourself in the appropriate directory. It doesn't have any sub directories so that should stop the error.
Just raising this one to track my implementation.
Few comments from the implementation
Logs for qstat