Open menendes opened 2 years ago
Can you get the log from the failed process? As far as I can tell, the printout only says that it terminated with a non-zero code.
Can you get the log from the failed process? As far as I can tell, the printout only says that it terminated with a non-zero code.
Hi Martin, you mean that slurmctld or slurmd log ? Where exactly can I view job logs ?
I'm afraid I don't know where such a log would appear, perhaps your sysadmin would know.
In the worker node when I view the logs I notice that some errors. Related logs in the below.
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: Launching batch job 67 for UID 1001
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: AcctGatherEnergy NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: AcctGatherProfile NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: AcctGatherInterconnect NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: AcctGatherFilesystem NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: switch NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: Job accounting gather LINUX plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: cont_id hasn't been set yet not running poll
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: laying out the 1 tasks on 1 hosts testslurmworker1 dist 2
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: Message thread started pid = 41666
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: Checkpoint plugin loaded: checkpoint/none
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: Munge credential signature plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: job_container none plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: Could not open stdout file /home/dask/.dask-gateway/2428b456f82a44fdb3c8e57576662e8f/dask-scheduler-2428b456f82a44fdb3c8e57576662e8f.log: >
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: IO setup failed: No such file or directory
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: step_terminate_monitor_stop signaling condition
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: job 67 completed with slurm_rc = 0, job_rc = 256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug: Message thread exited
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: done with job
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: _rpc_terminate_job, uid = 64030
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: task_p_slurmd_release_resources: affinity jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: credential for job 67 revoked
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: Waiting for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: Finished wait for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: Calling /usr/sbin/slurmstepd spank epilog
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug: Running spank/epilog for jobid [67] uid [1001]
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug: /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: completed epilog for jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug: Job 67: sent epilog complete msg: rc = 0
What happened: When I try to create cluster via dask gateway I getting error like below. When cluster created successfully ; I think it stucks in the pending status and shut down itself automatically. When I only use slurm command like sbatch I can verified that job successfully run over slurm cluster but when I try to generate job via dask gateway it automatically close itself after a few seconds.
dask_gateway_config.py
scontrol show job output
Environment: