OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
272 stars 101 forks source link

Duplicate jobs in 'active jobs'. #3668

Open gcpmendez opened 1 month ago

gcpmendez commented 1 month ago

We notice that in the Jobs -> Active jobs tab there are duplicate jobs per cluster as both have the same slurm configuration and slurm is configured with a single cluster:

$ _cpu1r
$ sacctmgr show cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
     teide      10.0.22.24         6817 10240         1                                                                                           normal    

and in ondemand we can verify the cluster configurations:

$ ssh root@ondemand.hpc.iter.es
$ cd /etc/ood/config/clusters.d
$ cat anaga.yml
---
v2:
  metadata:
    title: "Anaga"
  login:
    host: "10.5.22.101"
  job:
    adapter: "slurm"
    cluster: "teide"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    #bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: "/usr/bin/squeue"
      # scontrol: "/usr/bin/scontrol"
      # scancel: ""
    copy_enviornment: false
    partitions: ["gpu"]
  batch_connect:
    basic:
      script_wrapper: |
        ml purge
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      script_wrapper: |
        ml purge
        ml load TurboVNC
        #export PATH="/usr/local/turbovnc/bin:$PATH"
        #export WEBSOCKIFY_CMD="/usr/local/websockify/run"
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
$ cat teide.yml
---
v2:
  metadata:
    title: "Teide"
  login:
    host: "10.5.22.100"
  job:
    adapter: "slurm"
    cluster: "teide"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    #bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: "/usr/bin/squeue"
      # scontrol: "/usr/bin/scontrol"
      # scancel: ""
    copy_enviornment: false
  batch_connect:
    basic:
      script_wrapper: |
        ml purge
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      script_wrapper: |
        ml purge
        ml load TurboVNC
        #export PATH="/usr/local/turbovnc/bin:$PATH"
        #export WEBSOCKIFY_CMD="/usr/local/websockify/run"
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"

According to the following thread https://discourse.openondemand.org/t/configure-partitions-as-clusters/701/2 we can try to create an "initialiser" to filter the jobs.

The best would be to filter the jobs by "partition" and assign the "anaga" cluster when the jobs are in the "gpu" partition and assign them to the "teide" cluster in the rest of the cases.

$ _cpu1r
$ scontrol show partition | grep PartitionName
PartitionName=main
PartitionName=batch
PartitionName=express
PartitionName=long
PartitionName=gpu
PartitionName=fatnodes
PartitionName=ondemand

Any help is welcome in order to correctly view the jobs associated to each virtual cluster having a single cluster configured in Slurm. thanks in advance))

johrstrom commented 1 month ago

I'm not sure what the solution is here. Sure you can define an initializer to filter based on the cluster if you don't already have it, but if you've defined 2 clusters, OnDemand will act as if they're actually two clusters.

I'm not aware of the virtual cluster pattern here where teide is actually just a partition on anaga (or vice versa), but I guess I'd ask if you actually need the two separate cluster definitions.