NCAR / ncar-jobqueue

Utilities for configuring dask-jobqueue with appropriate settings for NCAR clusters
https://jobqueue.dask.org/
Apache License 2.0
13 stars 4 forks source link

avoid using fqdn? #58

Open dcherian opened 3 years ago

dcherian commented 3 years ago

I landed on a casper node that was named crthc02.hpc.ucar.edu instead of crhtc02.hpc.ucar.edu which broke ncar_jobqueue's regex.

I emailed cislhelp and they fixed it but also suggested not using the FQDN...

I'd also suggest that you avoid if you can, using the FQDN as an identifier for whatever purpose you're using it for.

Perhaps we should talk to them and figure out a better solution.

kmpaul commented 3 years ago

Did they suggest an alternative solution? I don't know of any other mechanism to determine if your node is in the Casper cluster or the Cheyenne cluster.

dcherian commented 3 years ago

I didn't ask them. I thought it would be better for xdev to open up a new conversation rather than extending the scope of that ticket.

andersy005 commented 3 years ago

Did they suggest an alternative solution? I don't know of any other mechanism to determine if your node is in the Casper cluster or the Cheyenne cluster.

Ccing @jbaksta

jbaksta commented 3 years ago

Why not just explicitly state which resource you're targeting as part of a job submission process? Is there a reason to tie you to a piece of hardware so to speak rather than just set an environment variable that says you submitted to Casper or Cheyenne? Basically, why inspect when you can be explicit on a submission? Hostnames are likely to be much more fluid; especially as we look at higher levels of enablement w/ Linux namespaces.

An alternative could be to inspect the $PBS_JOBID. Usually the CSG modules loaded set a specific environment variable too because they use something like that for $PATH building since we have shared application storage. At least with default modules on Cheyenne and Casper you'll have the two following set:

NCAR_HOST=cheyenne

NCAR_HOST=dav

Note that cross submission between clusters (new-ish PBS capability we're enabling), the environment may get reset during job submission, but loading the ncarenv module gives you the above.