E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
340 stars 343 forks source link

'bundled' E3SM jobs within a single job script can have non-unique LIDs #4336

Open worleyph opened 3 years ago

worleyph commented 3 years ago

@wagmanbe has a script that he uses in his autotuning work that looks like

 #SBATCH --job-name=251-300
 #SBATCH --nodes=100
 #SBATCH --output=mybundle.o%j
 #SBATCH --exclusive
 #SBATCH --time=150

 # Number of nodes required by each bundled job export
 SLURM_NNODES=2

 cd /lcrc/group/e3sm/ac.wagman/scratch/dakota/5d_chr_500/workdir.251/cloned.E3SM.ne4pg2_ne4pg2
 ./case.submit --no-batch >LOG 2>&1 & 
 ...
 cd /lcrc/group/e3sm/ac.wagman/scratch/dakota/5d_chr_500/workdir.300/cloned.E3SM.ne4pg2_ne4pg2
 ./case.submit --no-batch >LOG 2>&1 & 
 wait

This submits 50 2-node jobs in a 100 node allocation. In the performance archive, only two jobs were captured. And 48 of the jobs have warning messages something like ...

 /lcrc/group/e3sm/PERF_Chrysalis/performance_archive/ac.wagman/cloned.E3SM.ne4pg2_ne4pg2/32954.210409-125443 already exists. Skipping archive of timing data and associated provenance.

So, unique LIDs are not being generated (timestamp part of LID is not sufficiently high resolution to disambiguate these logically simultaneous job submissions). The above jobs could perhaps be "fixed" by removing the backgrounding of the case.submit, but it would be worthwhile coming up with an alternative that works with this workflow. We could also add more resolution to the timestamp to make this less likely (not sure what is available at the python level). Since still a single process executing case.submit, there should be a high enough resolution to eliminate generating nonunique LIDs?

Another idea would be to test whether a given LID has been used (i.e. whether the corresponding subdirectory already exists). If so, wait a second, generate a new LID, and then try again. This would space out the case.submits. There might also be the possibility of race conditions, except that this is still a single process executing the job script, so should be okay?

I don't know how easy it would be be to generate a new LID at the location where the current test is and have it be available every place that it is needed. Question for @jgfouca .

jgfouca commented 3 years ago

@worleyph , the creation of a new LID is well-encapsulated by CIME. This is the current impl:

def new_lid():
    lid = time.strftime("%y%m%d-%H%M%S")
    jobid = batch_jobid()
    if jobid is not None:
        lid = jobid+'.'+lid
    os.environ["LID"] = lid
    return lid

We could add something here, maybe the process id, to ensure it's unique.

worleyph commented 3 years ago

@sarats , does PACE use the particulars of the LID for anything, or does it just represent a 'black box' label for each job? That is, would changing it cause you any problems?

worleyph commented 3 years ago

@jgfouca , does backgrounding the ./case.submit call generate a unique process id for that particular call?

jgfouca commented 3 years ago

@worleyph , yes. Since these case.submits are being invoked as commands (and not as python library calls) the script must fork a new subprocess to execute them.

worleyph commented 3 years ago

@jgfouca , this will only be needed for case.submit calls with the --no-batch flag. Would it be possible to make the flag visible to new_lid (either as a global or passed into new_lid as an argument), and only in this case append the process id to the LID?

@sarats, I can modify the archiving script to recognize LIDs both with and without a process id. Will PACE have any problems with this?

jgfouca commented 3 years ago

@worleyph , as a potential workaround, since the lids have a seconds field, you could put sleeps in between the case.submit calls. A sleep of 2-3 seconds should guarantee unique LIDs. @jasonb5 , will you look into adding process id to LIDs?

sarats commented 3 years ago

This specific use-case is still evolving and the final implementation could take a different form. For instance, backgrounding a large number of processes didn't work at LCFs previously.

The first question is to identify if there is value in capturing provenance and performance for this style of experiments and the granularity desired. As it stands, this would generate a large number of small jobs accumulating in the archive (analogous to certain other cases where we turned off performance archiving). At this stage, my preference is to turn off archiving for these runs.

IIRC PACE treats the LIDs in a black box manner for storing in the database but I would need to double check the parsing logic to make sure PID/extra hypen wound't end up breaking anything.

worleyph commented 3 years ago

@jgfouca, thanks for suggesting adding waits in the job script. Eliminating the backgrounding would allow us to require only a sleep of 1 second. ./case.submit at the terminal level does not 'wait' until job finishes running. Is backgrounding necessary here (when using --nobatch)?

Appending process id to the lid still seems like the long term solution - it is non-ambiguous and should be full-proof. I just don't want to do so unless it is needed, i.e., only use it if --nobatch is specified. I'd also like the lid not to have varying lengths. Can the number of digits in the process id be fixed?

@sarats, at various time we (including you) have discussed how best to run ensembles. Bundling ensembles into a single job script seems like it is still something we might want to support. This is orthogonal to the issue of running many small jobs, which just happen to be bundled together.

worleyph commented 3 years ago

@jgfouca , "never mind"

  . ./case.submit at the terminal level does not 'wait' until job finishes running. Is backgrounding necessary here (when using --nobatch)?

Of course you need to background it. Same as inlining the call to srun.

jgfouca commented 3 years ago

@worleyph ,

I'd also like the lid not to have varying lengths.

I agree, no reason to add code branching if it's not really needed. I think we should just always add the PID and be done with it.

worleyph commented 3 years ago

I agree, no reason to add code branching if it's not really needed. I think we should just always add the PID and be done with it.

At least in the perl scripts I write, having two distinct forms (one with process id and one without) requires only minor changes. Since these are tacked on the end of most of the files copied to the performance archive, it would be annoying to make them even longer in order to support an infrequent use case. I was just talking about having process ids, when used. the same length, e.g. all 5 digits, leftmost zero filled.

worleyph commented 3 years ago

And, for backward compatibility with old runs and new runs with older versions of the model, we will need to support LIDs without process ids for the foreseeable future in any case.

sarats commented 3 years ago

Just wish to add the need to review PACE backend before implementing this change. So, make me a reviewer on any PR.

jgfouca commented 2 years ago

@worleyph , @sarats , I've thought about this some more. I am a bit scared to change the LID format, since this would have wide-ranging impacts across CIME, the host models that use CIME, and an unknown number of homegrown scripts.

Looking at the error Skipping archive of timing data and associated provenance., this is coming from e3sm's provenance code, which we own and are free to change. I propose we choose one of the following:

1) Modify E3SM's provenance code to handle an LID collision more gracefully. Open to ideas on we could know we were dealing with an LID collision and not simply a re-archiving of the same case. 2) Accept that this uncommon use case will require short sleeps between launching jobs. In other words, we will just live with the workaround I gave you as the longterm solution.

ndkeen commented 2 years ago

I wanted to point out that at least with slurm... which may be only tool we are trying to bundle with, we may be able to use the job step number as a unique identifier. For example, here is one bundled job:

cori04% sacct -j 52362182 -a -o JobID%-20,JobName,State,Elapsed,NNodes,Start
               JobID    JobName      State    Elapsed   NNodes               Start 
-------------------- ---------- ---------- ---------- -------- ------------------- 
52362182             bundle-gb+  COMPLETED   07:51:46     1024 2021-12-28T07:35:54 
52362182.batch            batch  COMPLETED   07:51:46        1 2021-12-28T07:35:54 
52362182.extern          extern  COMPLETED   07:51:49     1024 2021-12-28T07:35:54 
52362182.0             e3sm.exe  COMPLETED   01:24:48      128 2021-12-28T07:39:15 
52362182.1             e3sm.exe  COMPLETED   01:25:09      128 2021-12-28T07:39:15 
52362182.2             e3sm.exe  COMPLETED   01:25:06      128 2021-12-28T07:39:15 
52362182.3             e3sm.exe  COMPLETED   01:24:14      128 2021-12-28T07:39:18 
52362182.4             e3sm.exe  COMPLETED   01:25:01      128 2021-12-28T07:39:20 
52362182.5             e3sm.exe  COMPLETED   01:25:01      128 2021-12-28T07:39:20 
52362182.6             e3sm.exe  COMPLETED   01:25:18      128 2021-12-28T07:39:33 
52362182.7             e3sm.exe  COMPLETED   01:24:48      128 2021-12-28T07:39:33 
52362182.8             e3sm.exe  COMPLETED   02:06:55      128 2021-12-28T09:03:32 
52362182.9             e3sm.exe  COMPLETED   01:26:53      128 2021-12-28T09:04:03 
52362182.10            e3sm.exe  COMPLETED   01:58:29      128 2021-12-28T09:04:26 
52362182.11            e3sm.exe  COMPLETED   02:13:37      128 2021-12-28T09:04:26 
52362182.12            e3sm.exe  COMPLETED   01:27:27      128 2021-12-28T09:04:26 
52362182.13            e3sm.exe  COMPLETED   01:27:00      128 2021-12-28T09:04:26 
52362182.14            e3sm.exe  COMPLETED   02:07:07      128 2021-12-28T09:04:26 
52362182.15            e3sm.exe  COMPLETED   01:23:52      128 2021-12-28T09:04:51 
52362182.16            e3sm.exe  COMPLETED   01:23:44      128 2021-12-28T10:28:43 
52362182.17            e3sm.exe  COMPLETED   01:23:40      128 2021-12-28T10:30:56 
52362182.18            e3sm.exe  COMPLETED   01:23:15      128 2021-12-28T10:31:26 
52362182.19            e3sm.exe  COMPLETED   01:23:31      128 2021-12-28T10:31:53 
52362182.20            e3sm.exe  COMPLETED   01:23:40      128 2021-12-28T11:02:55 
52362182.21            e3sm.exe  COMPLETED   01:22:46      128 2021-12-28T11:10:27 
52362182.22            e3sm.exe  COMPLETED   01:22:36      128 2021-12-28T11:11:33 
52362182.23            e3sm.exe  COMPLETED   01:24:12      128 2021-12-28T11:18:03 
52362182.24            e3sm.exe  COMPLETED   01:23:07      128 2021-12-28T11:52:27 
52362182.25            e3sm.exe  COMPLETED   01:23:02      128 2021-12-28T11:54:36 
52362182.26            e3sm.exe  COMPLETED   01:23:49      128 2021-12-28T11:54:41 
52362182.27            e3sm.exe  COMPLETED   01:23:31      128 2021-12-28T11:55:24 
52362182.28            e3sm.exe  COMPLETED   01:23:15      128 2021-12-28T12:26:35 
52362182.29            e3sm.exe  COMPLETED   01:23:04      128 2021-12-28T12:33:13 
52362182.30            e3sm.exe  COMPLETED   01:23:02      128 2021-12-28T12:34:09 
52362182.31            e3sm.exe  COMPLETED   01:22:23      128 2021-12-28T12:42:15 
52362182.32            e3sm.exe  COMPLETED   01:23:15      128 2021-12-28T13:15:34 
52362182.33            e3sm.exe  COMPLETED   01:23:11      128 2021-12-28T13:17:38 
52362182.34            e3sm.exe  COMPLETED   01:23:07      128 2021-12-28T13:18:30 
52362182.35            e3sm.exe  COMPLETED   01:23:30      128 2021-12-28T13:18:55 
52362182.36            e3sm.exe  COMPLETED   01:23:04      128 2021-12-28T13:49:52 
52362182.37            e3sm.exe  COMPLETED   01:22:21      128 2021-12-28T13:56:17 
52362182.38            e3sm.exe  COMPLETED   01:23:13      128 2021-12-28T13:57:11
52362182.39            e3sm.exe  COMPLETED   01:22:49      128 2021-12-28T14:04:39 

And then I can use jobid.step to inquire about specific executions (here each of these are diff cases):

cori04% sacct -j 52362182.1 -a -o JobID%-20,JobName,State,Elapsed,NNodes,Start
               JobID    JobName      State    Elapsed   NNodes               Start 
-------------------- ---------- ---------- ---------- -------- ------------------- 
52362182.1             e3sm.exe  COMPLETED   01:25:09      128 2021-12-28T07:39:15 

cori04% sacct -j 52362182.39 -a -o JobID%-20,JobName,State,Elapsed,NNodes,Start
               JobID    JobName      State    Elapsed   NNodes               Start 
-------------------- ---------- ---------- ---------- -------- ------------------- 
52362182.39            e3sm.exe  COMPLETED   01:22:49      128 2021-12-28T14:04:39 

And here is example of a normal (non-bundled) job:

cori04% sacct -j 52816651 -a -o JobID%-20,JobName,State,Elapsed,NNodes,Start
               JobID    JobName      State    Elapsed   NNodes               Start 
-------------------- ---------- ---------- ---------- -------- ------------------- 
52816651             run.SCREA+  COMPLETED   03:04:43      192 2022-01-10T18:30:37 
52816651.batch            batch  COMPLETED   03:04:43        1 2022-01-10T18:30:37 
52816651.extern          extern  COMPLETED   03:04:56      192 2022-01-10T18:30:37 
52816651.0             e3sm.exe  COMPLETED   03:03:22      192 2022-01-10T18:31:42 

cori04% sacct -j 52816651.1 -a -o JobID%-20,JobName,State,Elapsed,NNodes,Start
               JobID    JobName      State    Elapsed   NNodes               Start 
-------------------- ---------- ---------- ---------- -------- -------------------