dask / dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
https://gateway.dask.org/
BSD 3-Clause "New" or "Revised" License
137 stars 88 forks source link

Dask Cluster stuck in the pending status and shutdown itself with Dask Gateway over the Slurm HPC Cluster #478

Open menendes opened 2 years ago

menendes commented 2 years ago

What happened: When I try to create cluster via dask gateway I getting error like below. When cluster created successfully ; I think it stucks in the pending status and shut down itself automatically. When I only use slurm command like sbatch I can verified that job successfully run over slurm cluster but when I try to generate job via dask gateway it automatically close itself after a few seconds.

from dask_gateway import Gateway
from dask_gateway import BasicAuth

auth = BasicAuth(username="dask", password="password")

gateway = Gateway("http://10.100.3.99:8000", auth=auth)

print(gateway.list_clusters())

cluster = gateway.new_cluster()
print(gateway.list_clusters())
gateway.close()

dask_gateway_config.py

c.DaskGateway.backend_class = (
    "dask_gateway_server.backends.jobqueue.slurm.SlurmBackend"
)

c.DaskGateway.authenticator_class = "dask_gateway_server.auth.SimpleAuthenticator"
c.SimpleAuthenticator.password = "password"
#c.SimpleAuthenticator.username = "dask"
c.DaskGateway.log_level = 'DEBUG'
#c.DaskGateway.show_config = True
c.SlurmClusterConfig.scheduler_cores = 1
c.SlurmClusterConfig.scheduler_memory = '500 M'
c.SlurmClusterConfig.staging_directory = '{home}/.dask-gateway/'
c.SlurmClusterConfig.worker_cores = 1
c.SlurmClusterConfig.worker_memory = '500 M'
c.SlurmBackend.backoff_base_delay = 0.1
c.SlurmBackend.backoff_max_delay = 300
#c.SlurmBackend.check_timeouts_period = 0.0
c.SlurmBackend.cluster_config_class = 'dask_gateway_server.backends.jobqueue.slurm.SlurmClusterConfig'
c.SlurmBackend.cluster_heartbeat_period = 15
c.SlurmBackend.cluster_start_timeout = 60
c.SlurmBackend.cluster_status_period = 30
c.SlurmBackend.dask_gateway_jobqueue_launcher = '/opt/dask-gateway/miniconda/bin/dask-gateway-jobqueue-launcher'

c.SlurmClusterConfig.adaptive_period = 3
c.SlurmClusterConfig.partition = 'computenodes'

scontrol show job output

scontrol_output

Environment:

martindurant commented 2 years ago

Can you get the log from the failed process? As far as I can tell, the printout only says that it terminated with a non-zero code.

menendes commented 2 years ago

Can you get the log from the failed process? As far as I can tell, the printout only says that it terminated with a non-zero code.

Hi Martin, you mean that slurmctld or slurmd log ? Where exactly can I view job logs ?

martindurant commented 2 years ago

I'm afraid I don't know where such a log would appear, perhaps your sysadmin would know.

menendes commented 2 years ago

In the worker node when I view the logs I notice that some errors. Related logs in the below.

Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: Launching batch job 67 for UID 1001
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherEnergy NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherProfile NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherInterconnect NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherFilesystem NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  switch NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Job accounting gather LINUX plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  cont_id hasn't been set yet not running poll
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  laying out the 1 tasks on 1 hosts testslurmworker1 dist 2
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Message thread started pid = 41666
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Checkpoint plugin loaded: checkpoint/none
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: Munge credential signature plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  job_container none plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: Could not open stdout file /home/dask/.dask-gateway/2428b456f82a44fdb3c8e57576662e8f/dask-scheduler-2428b456f82a44fdb3c8e57576662e8f.log: >
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: IO setup failed: No such file or directory
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  step_terminate_monitor_stop signaling condition
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: job 67 completed with slurm_rc = 0, job_rc = 256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Message thread exited
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: done with job
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  _rpc_terminate_job, uid = 64030
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  task_p_slurmd_release_resources: affinity jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  credential for job 67 revoked
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Waiting for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Finished wait for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Calling /usr/sbin/slurmstepd spank epilog
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  Running spank/epilog for jobid [67] uid [1001]
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  completed epilog for jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Job 67: sent epilog complete msg: rc = 0