cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Python submitted DRMAA jobs are not running on the worker nodes #246

Open vipints opened 9 years ago

vipints commented 9 years ago

Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with Exit_status=127, here is one such event, tracejob -slm -n 2 3025874. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.

-bash: module: line 1: syntax error: unexpected end of file -bash: error importing function definition for BASH_FUNC_module' -bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory

I am able to run this python script without Torque on the login machine and a worker node (with qlogin).

Did anybody use drmaa, python combination in cluster computing?

I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.

I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)

tatarsky commented 9 years ago

That bash error looks suspiciously like what the bash patch for "shellshock" says when an improper function invocation is attempted.

tatarsky commented 9 years ago

BTW are you saying "this worked before yesterday" ???

vipints commented 9 years ago

yes this scripts were running perfectly until yesterday.

tatarsky commented 9 years ago

Did you perhaps "add a module" yesterday? As that error has to do with as far as I can tell the "modules" package.

vipints commented 9 years ago

No

tatarsky commented 9 years ago

Your .bashrc is modified as of this morning....what was changed?

tatarsky commented 9 years ago

Also attempting to reproduce....

vipints commented 9 years ago

Just deleted an empty line, that is what I remember...

tatarsky commented 9 years ago

Well I'll have to look around. No changes I can think of on the cluster except the epilog script which only fires if the queue is "active".

vipints commented 9 years ago

I didn't make any changes to the scripts. thank you @tatarsky.

tatarsky commented 9 years ago

Do you know BTW what "the worker node" was in that message? Can dig around but if you already know it would be appreciated.

vipints commented 9 years ago

gpu-1-13

vipints commented 9 years ago

I asked for nodes=1 and ppn=4 and it dispatched:

exec_host=gpu-1-13/6+gpu-1-17/19+gpu-1-15/10+gpu-3-8/14

tatarsky commented 9 years ago

Yeah I saw that. Does your script contain an attempt to use "module" ? Or perhaps provide me the location of the item you run? That error is coming from the modules /etc/profile.d/modules.sh as far as I can tell. Which is untouched, so I'm curious whats calling it.

vipints commented 9 years ago

sending an email with details.

tatarsky commented 9 years ago

Thanks!

tatarsky commented 9 years ago

Made title of this more specific for my tracking purposes.

tatarsky commented 9 years ago

Under some condition the method DRMAA python uses to submit jobs appears to get blocked from submitting more data. I have @vipints running again but am chasing down what the resolution was for this Torque mailing list discussion.

http://www.supercluster.org/pipermail/torqueusers/2014-January/016732.html

I do not believe the hotfix "introduced" this problem as the date of this is old. Opening ticket with Adaptive to enquire.

vipints commented 9 years ago

Hi @tatarsky, today morning I noticed that python drmaa submitted jobs are not dispatching to the worker node. I am not able to see the start time: for example: showstart 3122268

INFO: cannot determine start time for job 3122268

Don't know what is happening here.

tatarsky commented 9 years ago

I don't see the same issue as before.

Looks to me like a simple case of your jobs being rejected due to resources.

checkjob -v 3122268
Node Availability for Partition MSKCC --------

gpu-3-9                  rejected: Features
gpu-1-4                  rejected: Features
gpu-1-5                  rejected: Features
gpu-1-6                  rejected: Features
gpu-1-7                  rejected: HostList
gpu-1-8                  rejected: HostList
gpu-1-9                  rejected: HostList
gpu-1-10                 rejected: HostList
gpu-1-11                 rejected: HostList
gpu-1-12                 rejected: Features
gpu-1-13                 rejected: Features
gpu-1-14                 rejected: Features
gpu-1-15                 rejected: Features
gpu-1-16                 rejected: Features
gpu-1-17                 rejected: Features
gpu-2-4                  rejected: HostList
gpu-2-5                  rejected: HostList
gpu-2-6                  rejected: Features
gpu-2-7                  rejected: HostList
gpu-2-8                  rejected: Features
gpu-2-9                  rejected: Features
gpu-2-10                 rejected: HostList
gpu-2-11                 rejected: Features
gpu-2-12                 rejected: Features
gpu-2-13                 rejected: HostList
gpu-2-14                 rejected: Features
gpu-2-15                 rejected: Features
gpu-2-16                 rejected: Features
gpu-2-17                 rejected: Features
gpu-3-8                  rejected: Features
cpu-6-1                  rejected: Features
cpu-6-2                  rejected: HostList
NOTE:  job req cannot run in partition MSKCC (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 608  feasible procs:   0

Node Rejection Summary: [Features: 21][HostList: 11]
vipints commented 9 years ago

thanks @tatarsky, I saw this message, forgot to include in the previous message. Not sure why it got rejected as I am requesting limited resources 12gb mem and 40hrs cput_time.

tatarsky commented 9 years ago

This is a little weird, perhaps a syntax error?

Features: cpu-6-2

So it seems to be asking for a feature of a hostname....

tatarsky commented 9 years ago

Its weird if you look at the "required hostlist" cpu-6-2 does not appear in it yet I see you requesting it.

Opsys: ---  Arch: ---  Features: cpu-6-2
Required HostList: [gpu-1-12:1][gpu-1-13:1][gpu-1-16:1][gpu-1-17:1][gpu-1-14:1][gpu-1-15:1]
  [cpu-6-1:1][gpu-3-8:1][gpu-3-9:1][gpu-1-4:1][gpu-1-5:1][gpu-1-6:1]
  [gpu-2-17:1][gpu-2-16:1][gpu-2-15:1][gpu-2-14:1][gpu-2-12:1][gpu-2-11:1]
  [gpu-2-6:1][gpu-2-9:1][gpu-2-8:1]
tatarsky commented 9 years ago

From the queue file...

<submit_args flags="1"> -N pj_41d1c2f4-e8c0-11e4-97d2-5fd54d3e274e -l mem=12gb -l vmem=12gb -l pmem=12gb -l pvmem=12gb 
-l nodes=1:ppn=1 -l walltime=40:00:00 -l host=gpu-1-12+gpu-1-13+gpu-1-16+gpu-1-17+gpu-1-14+gpu-1-15+cpu-6-2+cpu-6-1+gpu-3-8
+gpu-3-9+gpu-1-4+gpu-1-5+gpu-1-6+gpu-2-17+gpu-2-16+gpu-2-15+gpu-2-14+gpu-2-12+gpu-2-11+gpu-2-6+gpu-2-9+gpu-2-8</submit_args
>```
vipints commented 9 years ago

yes correct, I am requesting specific hostnames in my submission argument. Due to the OOM issue I have blacklisted the following nodes ['gpu-1-10', 'gpu-1-9', 'gpu-1-8', 'gpu-1-11', 'gpu-1-7', 'gpu-2-5', 'gpu-2-13', 'gpu-2-7', 'gpu-2-4', 'gpu-2-10']

tatarsky commented 9 years ago

Try the submit without the blacklist. The OOM issue is not node related. I continue to work on the best solution to it.

vipints commented 9 years ago

Sure trying now. Thank you!

tatarsky commented 9 years ago

To be clear its to try to figure out what looks to me like a valid "host=" stanza got interpreted as a feature request....

vipints commented 9 years ago

Ok, when drmaa mention the argument -l host in job submission, the job will stay in the queue. I removed the host argument from the submission, with that the jobs are submitted and finished successfully.

tatarsky commented 9 years ago

So I suspect its parsing it wrong somewhere but I have no clue where. Noted for next dive into the code for now. Working to remove reasons you are doing the hostlist in #241

vipints commented 9 years ago

thanks! I am just clearing the queued jobs.

vipints commented 9 years ago

checking the native specification of drmaa

tatarsky commented 9 years ago

I forgot how we were doing on this. No obvious config issues have been unearthed yet.

vipints commented 9 years ago

Hi @tatarsky, after restarting the pbs_server there was no issues to submit jobs via python drmaa. I tried submitting many number of jobs and they are successful.

Few days back I got an error message drmaa.errors.InternalException: code 1: (qsub) cannot access script file: but after 10 min when I tried it was working smoothly. Not much details associated this error.

tatarsky commented 9 years ago

OK. I'm leaving this open for a bit in case I am able to reproduce whatever happened.

vipints commented 9 years ago

Hi @tatarsky, I think DRMAA python reached the submission limit of the # of jobs via the drmaa qsub wrapper. I am not able to submit more jobs to the worker nodes. I think the pbs_server restart helped to resolve this problem last time. Can you help me here.

thank you!

tatarsky commented 9 years ago

Reproduced with my simple test. Restarted. Collected logs to add to ticket.

vipints commented 9 years ago

Thank you very much @tatarsky !

cganote commented 9 years ago

Just my 2 cents - I struggle with this issue as well, at least it sounds similar. We have to restart pbs_server every day or so because we get too many jobs submitted and then pbs doesn't respond. On restart, I get a bunch of messages that the jobs were in a strange substate. Let me know if I can help. This also happens using pbs_python, though the error messages are different.

tatarsky commented 9 years ago

Yeah I actually saw your discussion on this I believe. When I return to looking at this I will let you know if I get anywhere.

http://dev.list.galaxyproject.org/Errors-running-DRMAA-and-PBS-on-remote-server-running-Torque-4-td4662169.html

vipints commented 9 years ago

Hi @tatarsky, I may need your help to restart the pbs_server as I reached the limit (# of job submission via python drmaa) which is imposed by the qsub drmaa wrapper.

tatarsky commented 9 years ago

I've restarted it. I've not had time to delve into this.

vipints commented 9 years ago

thanks @tatarsky.

vipints commented 9 years ago

Hi @tatarsky, by late evening today I am getting job exit status 127 handled error message to my python drmaa submitted jobs. Job IDs are 3387611 - 3387680, they are just dispatching to the compute node and stalled at the next second. I double checked the paths used by the program, they all looks good. Any idea what's going wrong here? thanks

tatarsky commented 9 years ago

Restarted pbs_server. See if thats the issue again. But this isn't something I'm going to support like this going forward. Its obviously got problems.

On Monday I'd like a call to discuss how this software will be handled if we cannot find a fix for this. And how many hours we want to devote to finding that fix.

vipints commented 9 years ago

thanks @tatarsky, I just restarted my jobs now and they are running now.

I agree with your statement and I am going to look at the codebase tomorrow, talk to you then.

tatarsky commented 9 years ago

Today has had many issues and I have to cycle on a few more still. Perhaps sometime tomorrow we could review the plan of attack for this python.

tatarsky commented 9 years ago

Possible clue from config.log of our pbs_drmaa:

configure:16224: checking for pbs_submit_hash in -ltorque
/home/paul/pbs-drmaa-1.0.18/conftest.c:34: undefined reference to `pbs_submit_hash'

The discussion I reference back a few seems to state this all has something to do with that function.

tatarsky commented 9 years ago

Another clue:

http://www.supercluster.org/pipermail/torqueusers/2013-November/016514.html

Thats the author stating the version I believe before the one we used has this properly but ifdef'd. I wonder if its a "4" v.s. "5" sort of thing

vipints commented 9 years ago

Hi @tatarsky I believe this seems to be a "4" vs "5" issue.