OSC / ood_core

Open OnDemand core library
https://osc.github.io/ood_core/
MIT License
10 stars 30 forks source link

PBSPro caching previous cluster #180

Open johrstrom opened 4 years ago

johrstrom commented 4 years ago

We got a bug report in discourse with some strange behaviour with PBS.

What happens:

  1. submit a batch connect job against cluster_x. This works.
  2. submit a batch connect job against cluster_y. This actually submits to cluster_x even though the environment variable PBS_DEFAULT is correctly set to cluster_y.
  3. submit a batch connect job against cluster_z. This actually submits to cluster_y even though the environment variable PBS_DEFAULT is correctly set to cluster_z.

At this point you can rinse and repeat and it's always submitting the job to the previous cluster. If you wait ~60 seconds you can get back to stage 1 where it works correctly.

The fix was to submit the job with -q <queue>@<server> instead of just -q <queue>.

This ticket is to either figure out what's going on with the PBS_DEFAULT environment variable and why it's being cached or simply stop using it and instead force the server in the -q option.

The -q option can take any one of these 3 forms. Currently we're only using the first if queue_name or reservation_id is defined.

<queue>
<queue>@<server>
@<server>

┆Issue is synchronized with this Asana task by Unito

johrstrom commented 4 years ago

I know using PBS_DEFAULT works well for us using Torque/MOAB, but my vote would be to just use the -q option. It's known to works and would save us time in investigation.

treydock commented 4 years ago

I think using command flags makes sense and is more direct. Also would be same approach taken for submitting to different clusters in slurm.

johrstrom commented 4 years ago

This seems like it's an issue outside our control. It was shown that we correctly set the environment variable so the only alternative for us is to use the full -q option as it's a more explicit command (i.e., the DEFAULT implies it could be a fallback or is only implied).