SMART-Lab / smartdispatch

An easy to use job launcher for supercomputers with PBS compatible job manager.
Do What The F*ck You Want To Public License
34 stars 18 forks source link

Mismatch between PBS and moab job ID on some clusters #137

Closed slefrancois closed 8 years ago

slefrancois commented 8 years ago

On Calcul Québec's Helios cluster, the PBS job ID assigned by Torque and the moab job ID do not match. Right now, smart-dispatch displays and writes into job_id.txt the job returned by qsub (from moab), while the workers log the PBS_ID from Torque, which is available on all servers.

Smart-dispatch should output consistent job IDs. When running on clusters like Helios, that would require using the ID returned by qsub to find the PBS ID and display it.

MarcCote commented 8 years ago

Right. I've also observed that on Helios. The ids contained in job_id.txt can't be used with the qdel command which is annoying.

mgermain commented 8 years ago

I totally agree. Before that, we should investigate if there is a direct link between certain tools and which jobID is used. Tools like msub vs qsub, qstat, showq etc this will tell us if we should always report both ids or if we should "hide" the mismatch from the user and only report one.

slefrancois commented 8 years ago

I did on quick check on Colosse, which also uses msub. There qstat -f and the rest of the system output a single job id in PBS style, as expected. I haven't found traces of a separate MOAB ID.

So from my very representative two data points, I get the feeling it's a config quirk in Helios more than a feature of msub.

MarcCote commented 8 years ago

Can we close this one now that #139 is merged?