OSC / ood-activejobs

[MOVED] Active Jobs provides details of scheduled jobs on an HPC cluster.
https://osc.github.io/Open-OnDemand/
MIT License
0 stars 1 forks source link

Active Jobs page doesn't list any job #157

Closed kcgthb closed 6 years ago

kcgthb commented 6 years ago

Hi there!

I'm pretty sure it's a configuration issue on my end, but I'm just discovering OoD, I'm a bit overwhelmed by all the moving parts and I don't really know where to look.

I followed the installation instructions and installed the RPM (ondemand-1.3.5-2.el7.x86_64), I have a cluster file configured:

v2:
  metadata:
    title: "Cluster"
  job:
    adapter: "slurm"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"

and the dashboard, file explorer and shell apps are working beautifully. But the "Active Job" app doesn't and the job list stays desperately empty.

I can successfully run Slurm commands as a user on the OnDemand node (we're running Slurm 17.11.5), but it looks like the activejobs app cannot.

Do you have any suggestions on things I could check to understand where the problem is coming from?

Thanks!

brianmcmichael commented 6 years ago

Hi @kcgthb,

I notice that in the docs https://osc.github.io/ood-documentation/master/installation/add-cluster-config.html#slurm it mentions that you can remove the cluster name from the configuration file if you are not running in a multi-cluster configuration, however, I believe that the activejobs uses this value to determine the host name to display. It's yet to be determined whether this is a bug in our documentation or the activejobs app, but for now, would you mind adding the cluster value to that YAML file and letting us know if that fixes the app?

v2:
  metadata:
    title: "Cluster"
  job:
    adapter: "slurm"
    cluster: "cluster"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"

Thanks!

nickjer commented 6 years ago

@kcgthb

Hah, you found the RPM. Since you found the RPM that means you are probably reading the Documentation on the develop branch (yet to be released):

https://osc.github.io/ood-documentation/develop/installation/resource-manager/slurm.html

Your example cluster config looks good, but I do want to confirm that your bin: field is pointing to the path where you have the Slurm client binaries installed (e.g., is it /usr/bin/sbatch?).

Also if you are working in a multicluster environment then you may need to specify the cluster: field, otherwise ActiveJobs should work without it.

An example of a cluster config from one of our partners that is successfully using Slurm:

v2:
 metadata:
   title: "Cluster"
 login:
   host: "cluster.university.edu"
 job: 
   adapter: "slurm"
   bin:  "/opt/packages/slurm/default/bin"
nickjer commented 6 years ago

As a side question, do you see the cluster name "Cluster" in the top-right dropdown menu "All Clusters"

kcgthb commented 6 years ago

@brianmcmichael @nickjer Thanks! I appreciate the feedback and suggestion.

I did indeed follow the instructions from the develop branch, because the RPM installation was so appealing to me. :)

I tried with and without the "cluster:" line in the /etc/ood/config/clusters.d/mycluster.yaml configuration file, restarted httpd and touched /var/www/ood/apps/sys/dashboard/tmp/restart.txt between each try, but I can't still see any job listed.

My Slurm utilities are indeed in /usr/bin (we install Slurm as an RPM):

$ which squeue
/usr/bin/squeue

And I don't see my cluster name in the top-right dropdown menu "All Clusters". That's what it looks like: image

kcgthb commented 6 years ago

Aaargh, and then I just realized that I named my cluster config file mycluster.yaml, instead of mycluster.yml...

Using the proper extension made things work, all of a sudden. :smile:

Sorry for the noise, then, looks like things are working great now.

nickjer commented 6 years ago

Actually thanks for pointing that out. It escaped me to check both spellings of the extension. I will open up an issue in the relevant repo about that.