Open tirohia opened 11 months ago
If I'm reading things right, cluster.scale(n) is the first interaction with the PBS system. So it would make sense that there's no job submitted to the queue if it's omitted. Which clears up half the problem.
The other half being system not supporting the use of -l select.
If I'm reading things right, cluster.scale(n) is the first interaction with the PBS system. So it would make sense that there's no job submitted to the queue if it's omitted. Which clears up half the problem.
Absolutely, no job will be created and no Dask workers if you don't scale your cluster.
with parallel_backend('dask', scheduler_host=cluster):
client.submit(waffle, 1000)
This is weird, you don't want to mix Future API and joblib. joblib should be used only with Sklearn or other libraries supporting it.
The other half being system not supporting the use of -l select.
So your HPC system is not using the standard PBS way to select resources (or maybe this is the new standard in newer PBS versions?). In any case, the way to fix this is to customize the jobscript using job_directives_skip kwarg, and job_extra_directives
.
The following code runs to completion, but no job is ever submitted to the cluster's queue. It prints what appears to be a valid jobScript, but when I return to the terminal and run qstat -sxw, then no job has been submitted.
Minimal Complete Verifiable Example:
I'm new to dask, and I haven't been able to figure out how important the scale step is. Most examples that I've seen like this one include a line along the lines of
If I don't have this line, the code will run. If I give it a fairly hefty task (the sklearn model refinement that is my goal), I can look in the queue with qstat to see nothing, but I can also use htop and see that the job runs, but it's running on the login node, not the PBS cluster.
Adding this line before client=Client(cluster), causes the code to crash, returning one iteration of this for every n in cluster.scale(n):
I don't know if this is relevant or not though. In both cases (with or without the scale step), cluster.job_script() returns a line with:
I don't know why it's accepted without the scaling, but not with. If it's even relevant.
There is one issue with the same error message that I've found, but it went stale without a solution. I very definitely have sufficient resources to be able to submit the jobs though, so it's not that.
I'm running this on an HPC cluster with PBS. I'm loading the Python3 module, which gets me Python 3.11.5 Running in a python env, at the suggestion of the HPC administrators, rather than conda, but within that env I installed (using pip) dask and have: dask==2023.12.0 dask-jobqueue==0.8.2