cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Python submitted DRMAA jobs are not running on the worker nodes #246

Open vipints opened 9 years ago

vipints commented 9 years ago

Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with Exit_status=127, here is one such event, tracejob -slm -n 2 3025874. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.

-bash: module: line 1: syntax error: unexpected end of file -bash: error importing function definition for BASH_FUNC_module' -bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory

I am able to run this python script without Torque on the login machine and a worker node (with qlogin).

Did anybody use drmaa, python combination in cluster computing?

I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.

I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)

tatarsky commented 8 years ago

Thats exciting because I believe from some threads the limit was 1024. Lets declare victory at 10K ;)

tatarsky commented 8 years ago

So what do you think the count is at?

vipints commented 8 years ago

so far I have reached 4287.

tatarsky commented 8 years ago

Very cool. I'll ask again in 14 days is my guesstimate for 10K. While it seems likely that was the fix, lets let it ride some more.

vipints commented 8 years ago

@tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch... There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.
@cganote: just checking, is your drmaa jobs are OK with the patch?

cganote commented 8 years ago

I haven't seen any issues, but maybe I'm not getting enough jobs submitted through drmaa? I certainly haven't had 6000 yet.

-Carrie

From: Vipin notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Thursday, November 12, 2015 at 1:22 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Carrie Ganote cganote@iu.edu<mailto:cganote@iu.edu> Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)

@tatarskyhttps://github.com/tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch... There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.

@cganotehttps://github.com/cganote: just checking, is your drmaa jobs are OK with the patch?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/246#issuecomment-156190350.

vipints commented 8 years ago

Thanks @cganote.

tatarsky commented 8 years ago

Thats a rather odd number. I'd like to poke around for a bit before I restart pbs_server see if I learn anything new. Are you under time pressure to get more of these in?

vipints commented 8 years ago

If you find sometime in today evening, I will be happy. Thanks!

tatarsky commented 8 years ago

Similar code 15007 response in logs claiming "unauthorized request"

tatarsky commented 8 years ago

No new information gained. Restarted pbs_server.

vipints commented 8 years ago

@tatarsky, this time drmaa reached the limit of max_num_jobs in just after 366 job requests.

vipints commented 8 years ago

seems like an odd behavior this time.

vipints commented 8 years ago

Whenever you have small time frame, I may need a restart to the _pbsserver thank you.

tatarsky commented 8 years ago

Restarted. I have this slated for possible test attempts on the new scheduler head I've built. The current issue is I need to have some nodes to test that system with and we're working on a schedule. It seems that the patch does not solve the problem. But its unclear if it hurts overall or helps. Seems weird this one didn't even get to the "normal" 1024 or so.

vipints commented 8 years ago

It could be a reason that someone else is using drmaa to submit the jobs to cluster. The count (366) of jobs is just from my side.

Is there anybody else using drmaa python combination to submit jobs on hal?

tatarsky commented 8 years ago

Not that I've ever heard of.

tatarsky commented 8 years ago

This worlds longest issue may be further attacked via #349 . However its unclear how it would be attacked at this moment in time.

raylim commented 8 years ago

Has there been any progress on this issue? Just encountered it today.

$ ipython
In [1]: import drmaa
In [2]: s = drmaa.Session()
In [3]: s.initialize()
In [4]: jt = s.createJobTemplate()
In [5]: jt.remoteCommand = 'hostname'
In [6]: jobid = s.runJob(jt)
In [7]: retval = s.wait(jobid)
In [8]: retval
Out[8]: JobInfo(jobId=u'7551501.hal-sched1.local', hasExited=False, hasSignal=False, terminatedSignal=u'unknown signal?!', hasCoreDump=False, wasAborted=True, exitStatus=127, resourceUsage={u'mem': u'0', u'start_time': u'1467389618', u'queue': u'batch', u'vmem': u'0', u'hosts': u'gpu-1-4/4', u'end_time': u'1467389619', u'submission_time': u'1467389616', u'cpu': u'0', u'walltime': u'0'})
In [9]: retval.exitStatus
Out[9]: 127
tatarsky commented 8 years ago

No. No precise solution has ever been found. I will restart pbs_server and you can tell me if it works after that. Then the item will be moved to Fogbugz.

vipints commented 8 years ago

I am not sure whether we found a way to fix this, If you are getting the error means the drmaa reached the _max_num_ofjobs. to fix this you may need a _pbsserver restart from admins to clear the job ids.

tatarsky commented 8 years ago

Server restarted to confirm your example is a case of this. If so, open a ticket in FogBugz via the email address listed in the /etc/motd on hal. I won't be processing items here further.

vipints commented 8 years ago

Sorry I meant to report via email to the cbio-admin group. Thanks @tatarsky!

tatarsky commented 8 years ago

Thats fine. This ticket has a long long gory history. But all further attempts to figure it out require involvement by the primary support which is as of today MSKCC staff. I will assist then as needed but I don't feel this is likely trivially fixed. As we both know DRMAA is quite a hack for Torque.

tatarsky commented 8 years ago

I do notice since last we battled this there is another release of pbs-drmaa. 1.0.19.

Perhaps by some friday miracle they are using the Torque 5.0 submit call compared to the crufty 4.0 one that seems to be buggy.

vipints commented 8 years ago

Yeah that is correct, Seems like they have support for v5. May be we can try after the long weekend. I didn't check the recent release version.

tatarsky commented 8 years ago

I see we actually noticed it when it came out last year. I see nothing overly "Torque 5" in it yet.

I am unlikely to look at this further today. Confirm/deny that your example now works with pbs_server restarted and open a ticket for some work next week.

raylim commented 8 years ago

Yes, python drmaa job submission works now.

tatarsky commented 8 years ago

Kick an email to the address listed for problem reports in /etc/motd (sorry I'm not placing it again in the public Git) to start tracking it there. We'll reference this Git thread but we no longer process bugs here.

tatarsky commented 8 years ago

Not that this one is likely to be fixable anytime soon. We've tried for many years and DRMAA is basically not well supported by Adaptive.