ESMCI / cime

Common Infrastructure for Modeling the Earth
http://esmci.github.io/cime
Other
162 stars 207 forks source link

env_batch.py does not pick up queue-specific directives and only uses ones from default queue #4564

Closed mvdebolskiy closed 9 months ago

mvdebolskiy commented 10 months ago

Was trying to port CESM2.1.5 on one of the sigma2 machines (with slurm scheduler), which have nodemin>4 for the normal queue. The different queues on this machines are handled by --qos and --partition directives (some partitions have more than one --qos). Similar to archer2 config:

  <batch_system MACH="archer2" type="slurm" >
    <batch_submit>sbatch</batch_submit>
    <submit_args>
      <arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
      <arg flag="-q" name="$JOB_QUEUE"/>
      <arg flag="--account" name="$PROJECT"/>
      <arg flag="--export=ALL"/>
    </submit_args>

    <directives queue="standard">
      <directive>--partition=standard</directive>
      <directive>--qos=standard</directive>
      <directive>--cpus-per-task={{ thread_count }}</directive>
    </directives>

    <directives queue="short">
      <directive>--partition=standard</directive>
      <directive>--qos=short</directive>
      <directive>--cpus-per-task={{ thread_count }}</directive>
    </directives>

    <directives queue="serial">
      <directive>--partition=serial</directive>
      <directive>--qos=serial</directive>
    </directives>

    <!-- following will need updating as system develops -->
    <queues>
      <queue walltimemax="24:00:00" nodemin="1" nodemax="1024" default="true">standard</queue>
      <!-- <queue walltimemin="24:00:00" walltimemax="48:00:00" nodemin="1" nodemax="64" >long</queue> -->
      <queue walltimemax="00:20:00" nodemin="1" nodemax="32">short</queue>
      <queue walltimemax="24:00:00" nodemin="1" nodemax="1">serial</queue>
    </queues>
  </batch_system>

master set the directives correctly. However maint-5.6 does not. Even though it assigns a correct queue in select_best_queue (JOB_QUEUE gets set correctly).

mvdebolskiy commented 10 months ago

Found a fix. https://github.com/ESMCI/cime/blob/a488510bc24b981809b3dda5ee7b9809a7fc9616/scripts/lib/CIME/XML/env_batch.py#L331 Changing self to case here, same way it is done in maint-5.8 actually gets the proper $JOB_QUEUE value and not None.