dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

Unable to submit jobs to PBS queue #619

Open tirohia opened 11 months ago

tirohia commented 11 months ago

The following code runs to completion, but no job is ever submitted to the cluster's queue. It prints what appears to be a valid jobScript, but when I return to the terminal and run qstat -sxw, then no job has been submitted.

Minimal Complete Verifiable Example:

from dask_jobqueue import PBSCluster
from dask.distributed import Client
from joblib import parallel_backend, delayed, Parallel

cluster = PBSCluster(
    cores=8,
    processes=1,
    memory='24 GB',
    account='acc05',
    queue="normalbw"
)

client = Client(cluster)

def waffle(number):
    sleep(1)
    return(number+3)

with parallel_backend('dask', scheduler_host=cluster):
     client.submit(waffle, 1000)

print(cluster.job_script())

client.close()
cluster.close()

I'm new to dask, and I haven't been able to figure out how important the scale step is. Most examples that I've seen like this one include a line along the lines of

cluster.scale(10)

If I don't have this line, the code will run. If I give it a fairly hefty task (the sklearn model refinement that is my goal), I can look in the queue with qstat to see nothing, but I can also use htop and see that the job runs, but it's running on the login node, not the PBS cluster.

Adding this line before client=Client(cluster), causes the code to crash, returning one iteration of this for every n in cluster.scale(n):

Task exception was never retrieved                                                                                                                                                                                                                                                                                            
future: <Task finished name='Task-31' coro=<_wrap_awaitable() done, defined at /home/569/bc0283/workspace/classifiers/src/exploration/methylation/rp2venv/lib/python3.11/site-packages/distributed/deploy/spec.py:124> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 32\nCommand:\nqsub /scratch
/acc5/bc0283/tmp/tmprxlpmgjk.sh\nstdout:\n\nstderr:\nqsub: Error: The system doesn\'t support the use of "-l select". Please use "-l ncpus" and "-l mem" instead.\n\n')>                                                                                                                                                      
Traceback (most recent call last):                                                                                                                                                                                                                                                                                            
  File "/home/569/bc0283/workspace/classifiers/src/exploration/methylation/rp2venv/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable                                                                                                                                                     
    return await aw                                                                                                                                                                                                                                                                                                           
           ^^^^^^^^                                                                                                                                                                                                                                                                                                           
  File "/home/569/bc0283/workspace/classifiers/src/exploration/methylation/rp2venv/lib/python3.11/site-packages/distributed/deploy/spec.py", line 74, in _                                                                                                                                                                    
    await self.start()                                                                                                                                                                                                                                                                                                        
  File "/home/569/bc0283/workspace/classifiers/src/exploration/methylation/rp2venv/lib/python3.11/site-packages/dask_jobqueue/core.py", line 411, in start                                                                                                                                                                    
    out = await self._submit_job(fn)                                                                                                                                                                                                                                                                                          
          ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                          
  File "/home/569/bc0283/workspace/classifiers/src/exploration/methylation/rp2venv/lib/python3.11/site-packages/dask_jobqueue/core.py", line 394, in _submit_job                                                                                                                                                              
    return self._call(shlex.split(self.submit_command) + [script_filename])                                                                                                                                                                                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                   
  File "/home/569/bc0283/workspace/classifiers/src/exploration/methylation/rp2venv/lib/python3.11/site-packages/dask_jobqueue/core.py", line 489, in _call                                                                                                                                                                    
    raise RuntimeError(                                                                                                                                                                                                                                                                                                       
RuntimeError: Command exited with non-zero exit code.                                                                                                                                                                                                                                                                         
Exit code: 32                                                                                                                                                                                                                                                                                                                 
Command:                                                                                                                                                                                                                                                                                                                      
qsub /scratch/acc5/bc0283/tmp/tmprxlpmgjk.sh                                                                                                                                                                                                                                                                                  
stdout:                                                                                                                                                                                                                                                                                                                       

stderr:                                                                                                                                                                                                                                                                                                                       
qsub: Error: The system doesn't support the use of "-l select". Please use "-l ncpus" and "-l mem" instead.                                                                                                                                                                                                                   

I don't know if this is relevant or not though. In both cases (with or without the scale step), cluster.job_script() returns a line with:

#PBS -l select=1:ncpus=8:mem=24GB                                                                                                                                                                                                                                                                                             

I don't know why it's accepted without the scaling, but not with. If it's even relevant.

There is one issue with the same error message that I've found, but it went stale without a solution. I very definitely have sufficient resources to be able to submit the jobs though, so it's not that.

I'm running this on an HPC cluster with PBS. I'm loading the Python3 module, which gets me Python 3.11.5 Running in a python env, at the suggestion of the HPC administrators, rather than conda, but within that env I installed (using pip) dask and have: dask==2023.12.0 dask-jobqueue==0.8.2

tirohia commented 11 months ago

If I'm reading things right, cluster.scale(n) is the first interaction with the PBS system. So it would make sense that there's no job submitted to the queue if it's omitted. Which clears up half the problem.

The other half being system not supporting the use of -l select.

guillaumeeb commented 11 months ago

If I'm reading things right, cluster.scale(n) is the first interaction with the PBS system. So it would make sense that there's no job submitted to the queue if it's omitted. Which clears up half the problem.

Absolutely, no job will be created and no Dask workers if you don't scale your cluster.

with parallel_backend('dask', scheduler_host=cluster):
     client.submit(waffle, 1000)

This is weird, you don't want to mix Future API and joblib. joblib should be used only with Sklearn or other libraries supporting it.

The other half being system not supporting the use of -l select.

So your HPC system is not using the standard PBS way to select resources (or maybe this is the new standard in newer PBS versions?). In any case, the way to fix this is to customize the jobscript using job_directives_skip kwarg, and job_extra_directives.