Open vipints opened 9 years ago
Thats exciting because I believe from some threads the limit was 1024. Lets declare victory at 10K ;)
So what do you think the count is at?
so far I have reached 4287
.
Very cool. I'll ask again in 14 days is my guesstimate for 10K. While it seems likely that was the fix, lets let it ride some more.
@tatarsky: by today, I have reached total number finished jobs 5976
but now I m triggering the error message max_num_job_reached
from drmaa
. Seems like it is not happy with the patch...
There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.
@cganote: just checking, is your drmaa
jobs are OK with the patch?
I haven't seen any issues, but maybe I'm not getting enough jobs submitted through drmaa? I certainly haven't had 6000 yet.
-Carrie
From: Vipin notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Thursday, November 12, 2015 at 1:22 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Carrie Ganote cganote@iu.edu<mailto:cganote@iu.edu> Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)
@tatarskyhttps://github.com/tatarsky: by today, I have reached total number finished jobs 5976 but now I m triggering the error message max_num_job_reached from drmaa. Seems like it is not happy with the patch... There is a new version of pbs-drmaa-1.0.19 available, just comparing changes from the previous one we are using.
@cganotehttps://github.com/cganote: just checking, is your drmaa jobs are OK with the patch?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/246#issuecomment-156190350.
Thanks @cganote.
Thats a rather odd number. I'd like to poke around for a bit before I restart pbs_server see if I learn anything new. Are you under time pressure to get more of these in?
If you find sometime in today evening, I will be happy. Thanks!
Similar code 15007 response in logs claiming "unauthorized request"
No new information gained. Restarted pbs_server.
@tatarsky, this time drmaa reached the limit of max_num_jobs
in just after 366 job requests.
seems like an odd behavior this time.
Whenever you have small time frame, I may need a restart to the _pbsserver thank you.
Restarted. I have this slated for possible test attempts on the new scheduler head I've built. The current issue is I need to have some nodes to test that system with and we're working on a schedule. It seems that the patch does not solve the problem. But its unclear if it hurts overall or helps. Seems weird this one didn't even get to the "normal" 1024 or so.
It could be a reason that someone else is using drmaa to submit the jobs to cluster. The count (366) of jobs is just from my side.
Is there anybody else using drmaa python combination to submit jobs on hal?
Not that I've ever heard of.
This worlds longest issue may be further attacked via #349 . However its unclear how it would be attacked at this moment in time.
Has there been any progress on this issue? Just encountered it today.
$ ipython
In [1]: import drmaa
In [2]: s = drmaa.Session()
In [3]: s.initialize()
In [4]: jt = s.createJobTemplate()
In [5]: jt.remoteCommand = 'hostname'
In [6]: jobid = s.runJob(jt)
In [7]: retval = s.wait(jobid)
In [8]: retval
Out[8]: JobInfo(jobId=u'7551501.hal-sched1.local', hasExited=False, hasSignal=False, terminatedSignal=u'unknown signal?!', hasCoreDump=False, wasAborted=True, exitStatus=127, resourceUsage={u'mem': u'0', u'start_time': u'1467389618', u'queue': u'batch', u'vmem': u'0', u'hosts': u'gpu-1-4/4', u'end_time': u'1467389619', u'submission_time': u'1467389616', u'cpu': u'0', u'walltime': u'0'})
In [9]: retval.exitStatus
Out[9]: 127
No. No precise solution has ever been found. I will restart pbs_server and you can tell me if it works after that. Then the item will be moved to Fogbugz.
I am not sure whether we found a way to fix this, If you are getting the error means the drmaa reached the _max_num_ofjobs. to fix this you may need a _pbsserver restart from admins to clear the job ids.
Server restarted to confirm your example is a case of this. If so, open a ticket in FogBugz via the email address listed in the /etc/motd on hal. I won't be processing items here further.
Sorry I meant to report via email to the cbio-admin group. Thanks @tatarsky!
Thats fine. This ticket has a long long gory history. But all further attempts to figure it out require involvement by the primary support which is as of today MSKCC staff. I will assist then as needed but I don't feel this is likely trivially fixed. As we both know DRMAA is quite a hack for Torque.
I do notice since last we battled this there is another release of pbs-drmaa. 1.0.19.
Perhaps by some friday miracle they are using the Torque 5.0 submit call compared to the crufty 4.0 one that seems to be buggy.
Yeah that is correct, Seems like they have support for v5. May be we can try after the long weekend. I didn't check the recent release version.
I see we actually noticed it when it came out last year. I see nothing overly "Torque 5" in it yet.
I am unlikely to look at this further today. Confirm/deny that your example now works with pbs_server restarted and open a ticket for some work next week.
Yes, python drmaa job submission works now.
Kick an email to the address listed for problem reports in /etc/motd (sorry I'm not placing it again in the public Git) to start tracking it there. We'll reference this Git thread but we no longer process bugs here.
Not that this one is likely to be fixable anytime soon. We've tried for many years and DRMAA is basically not well supported by Adaptive.
Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with
Exit_status=127
, here is one such event,tracejob -slm -n 2 3025874
. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for BASH_FUNC_module'
-bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory
I am able to run this python script without Torque on the login machine and a worker node (with qlogin).
Did anybody use drmaa, python combination in cluster computing?
I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.
I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)