dlcv-journal-club / TextToImageModels

1 stars 1 forks source link

Running my own dreambooth #3

Open mxochicale opened 1 year ago

mxochicale commented 1 year ago

Just raising this one to track my implementation.

Few comments from the implementation

  1. qdel is useful and also the meaing of Eqw in qstat list > https://www.t3.gsic.titech.ac.jp/en/node/65
  2. Using google-colab is straightforward setup. Worthwhile to try paperspace as well.

Logs for qstat

]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 979761 3.33562 SimpleMode ccaemxo      Eqw   06/02/2023 10:53:02                                    1        
2264821 0.00000 Dreambooth ccaemxo      qw    07/28/2023 15:29:42                                    1   
$ qdel 979761
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2264821 0.00000 Dreambooth ccaemxo      qw    07/28/2023 15:29:42                                    1        
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2264821 3.12906 Dreambooth ccaemxo      Eqw   07/28/2023 15:29:42                                    1        
mxochicale commented 1 year ago

Hi @harveymannering

It seems we are getting an error but not sure what's the issue. I just put the reference here to hopefully sort this one. Eqw: there was an error in this jobscript. This will not run. > https://www.rc.ucl.ac.uk/docs/howto/#job-states Thanks, Miguel

jeremyestein commented 1 year ago

I'm getting an Eqw state as well. I will try again having created the ~/runLog directory.

 $ qstat -explain E -j 2266220
==============================================================
job_number:                 2266220
exec_file:                  job_scripts/2266220
submission_time:            Fri Jul 28 16:03:08 2023
owner:                      cceajes
uid:                        280582
group:                      cceas2
gid:                        70037
sge_o_home:                 /home/cceajes
sge_o_log_name:             cceajes
sge_o_path:                 /shared/ucl/apps/pytorch/1.11.0/python3.9.6/cuda/bin:/shared/ucl/apps/cuda/11.3.1/gnu-10.2.0/bin:/shared/ucl/apps/cuda/11.3.1/gnu-10.2.0/nsight-compute-2021.1.1:/home/cceajes/.python3local/bin:/shared/ucl/apps/python/bundles/gnu-10.2.0/python39-6.0.0/venv/bin:/shared/ucl/apps/openblas/0.3.13-serial/gnu-10.2.0/bin:/shared/ucl/apps/python/3.9.6/gnu-10.2.0/bin:/shared/ucl/apps/gcc/10.2.0-p95889/bin:/home/cceajes/miniconda3/bin:/home/cceajes/miniconda3/condabin:/shared/ucl/apps/cluster-bin:/shared/ucl/apps/cluster-scripts:/shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/tmux/3.3a/bin:/shared/ucl/apps/emacs/28.1/bin:/shared/ucl/apps/giflib/5.1.1/gnu-4.9.2/bin:/shared/ucl/apps/dos2unix/7.3/gnu-4.9.2/bin:/shared/ucl/apps/NEdit/5.6-Aug15/bin:/shared/ucl/apps/nano/2.4.2/gnu-4.9.2/bin:/shared/ucl/apps/GERun:/shared/ucl/apps/screen/4.9.0/bin:/shared/ucl/apps/subversion/1.14.1/bin:/shared/ucl/apps/apr-util/1.6.1/bin:/shared/ucl/apps/apr/1.7.0/bin:/shared/ucl/apps/git/2.32.0/gnu-4.9.2/bin:/shared/ucl/apps/flex/2.5.39/gnu-4.9.2/bin:/shared/ucl/apps/cmake/3.21.1/gnu-4.9.2/bin:/opt/sge/bin:/opt/sge/bin/lx-amd64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/home/cceajes/.local/bin:/home/cceajes/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /lustre/home/cceajes/text_to_image/TextToImageModels/dreambooth
sge_o_host:                 login13
account:                    policyjsv;F=1;J=1;B=0;E=1;H=0;D=0;I=0;L=1
reserve:                    y
merge:                      y
hard resource_list:         batch=true,gpu=1,h_rt=3600,memory=8G,snx=1
mail_list:                  cceajes@login13.myriad.ucl.ac.uk
notify:                     FALSE
job_name:                   Dreambooth
stdout_path_list:           NONE:NONE:~/runLog/
jobshare:                   0
restart:                    n
shell_list:                 NONE:/bin/bash
env_list:                   TERM=NONE,XAUTHORITY=/scratch/scratch/cceajes/.Xauthority,PAID=0,GPU=1,SGE_UCL_MEM=8589934592,MICCOUNT=0,SCRATCH_SPACE=10737418240,SGE_ONE=1,SGE_SHARENODE=1,IFS=
script_file:                finetuning.qsub.sh
parallel environment:  smp-[FJEL]* range: 1
project:                    AllUsers
binding:                    set linear:slots
job_type:                   NONE
error reason          1:      07/28/2023 21:05:00 [280582:177077]: error: can't open output file "/home/cceajes/runLog/": Is a directory
scheduling info:            (Collecting of scheduler job information is turned off)
harveymannering commented 1 year ago

Did you guys try the qexplain command? I'm not sure why this would be failing, but I would be curious to see the explanation given by qexplain.

jeremyestein commented 1 year ago

It was caused by the /home/cceajes/runLog/ directory not existing. qexplain looks to be a wrapper around qstat -j so would give a very similar thing to above.

harveymannering commented 1 year ago

I'm surprised qsub doesn't create the runLog folder itself. I suppose you could just run mkdir runLog yourself in the appropriate directory. It doesn't have any sub directories so that should stop the error.