Closed minxu74 closed 1 month ago
@minxu74 can you please provide detailed instructions to reproduce the issue? In particular the output of ./preview_run for each of the two cases might be very reviling. Also indicate what cime tag you are using. Thanks
@jedwards4b Thanks a lot for your quick response.
E3SM: fc020071ceb7e5f37ee2d39f3d2f18084fdf8148
CIME: 0cdd4b1c5c5eb2e29c6ec64667724af434847bcf
The simplified steps to reproduce the problem are as follow:
1. ./create_newcase --case test_case --mach pm-cpu --compset I18
50CNPRDCTCBC --res hcru_hcru --mpilib mpich --walltime 24:00:00 --handle-preexisting-dirs u --project xxxx --compiler intel
2. ./xmlchange STOP_N=20
3. ./xmlchange REST_N=20
4. ./case.setup -r
5. ./case.build
6. ./case.submit (JOB A with JOBID)
7. ./xmlchange CONTINUE_RUN=TRUE
8. (JOB B)
- ./case.submit --prereq JOBID (passed)
- ./case.submit --batch-args="--dependency=afterok:JOBID" (failed)
The above two jobs have the same output of preview_run as follows:
CASE INFO:
nodes: 12
total tasks: 1536
tasks per node: 128
thread count: 1
ngpus per node: 0
BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment ADIOS2_ROOT=/global/cfs/cdirs/e3sm/3rdparty/adios2/2.9.1/cray-mpich-8.1.25/intel-2023.1.0
Setting Environment BLA_VENDOR=Intel10_64_dyn
Setting Environment FI_CXI_RX_MATCH_MODE=software
Setting Environment GATOR_INITIAL_MB=4000MB
Setting Environment HDF5_USE_FILE_LOCKING=FALSE
Setting Environment MOAB_ROOT=/global/cfs/cdirs/e3sm/software/moab/intel
Setting Environment MPICH_COLL_SYNC=MPI_Bcast
Setting Environment MPICH_ENV_DISPLAY=1
Setting Environment MPICH_MPIIO_DVS_MAXNODES=1
Setting Environment MPICH_VERSION_DISPLAY=1
Setting Environment NETCDF_PATH=/opt/cray/pe/netcdf-hdf5parallel/4.9.0.9/intel/2023.2
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_PLACES=threads
Setting Environment OMP_PROC_BIND=spread
Setting Environment OMP_STACKSIZE=128M
Setting Environment PERL5LIB=/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch
Setting Environment PNETCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.9/intel/2023.2
SUBMIT CMD:
sbatch --time 48:00:00 -q regular --account xxxx test_case/.case.run --resubmit
MPIRUN (job=case.run):
srun --label -n 1536 -N 12 -c 2 --cpu_bind=cores -m plane=128 test_case/bld/e3sm.exe >> e3sm.log.$LID 2>&1
I think that I misread the issue - can you please clarify? My understanding is that using the --prereq flag works as expected, but using --batch-args does not. Is that correct?
I think that I misread the issue - can you please clarify? My understanding is that using the --prereq flag works as expected, but using --batch-args does not. Is that correct?
Yes.
Using cime6.1.29 (cesm3) I have checked that both these methods create the same sbatch command.
sbatch --time 02:00:00 -q debug --account mp9 --dependency=afterok:99999 /global/u1/j/jedwards/cesm3/cime/scripts/caseB/.case.run --resubmit
I also checked cime hash 0cdd4b1c5c5eb2e29c6ec64667724af434847bcf and both methods appear to work the same. Can you capture and post the output of the case.submit command for each of these methods? I'm interested in the sbatch command generated for each method. I think that perhaps what is wrong is that you are specifying the case.run JobID and what you need to specify is the case.st_archive jobid. If you specify the case.run jobid you create a race condition in which the second case may start before the case.st_archive from the first run is completed, which will produce the error you are reporting.
The short-term archive was turned off in both jobs. I anticipated that both jobs would have the same sbatch. The error was caused by the CIME to check the restart files when "CONTINUE_RUN=.TRUE." and using "--batch-args", but the CIME skipped the checking when using "--prereq".
Thank you - that helps a lot. Using --prereq has the side effect of skipping the CONTINUE_RUN check. I'm not sure how you could possibly implement this same side effect using the batch-args method. Would simply clarifying the documentation of this feature be an acceptable solution?
Thanks. Yes. It will be helpful to clarify the difference between the prereq and batch-args in the document with regard to the side effect of skipping the restart file check.
Two jobs A and B, B with
CONTINUE_RUN=.TRUE.
depends on A and A will generate restart files for B.--batch-args="--dependency=afterok:JOBID"
job B failed with an error "ERROR: CONTINUE_RUN is true but this case does not appear to have restart files staged in "--prereq JOBID
job B was submitted successfully