Python submitted DRMAA jobs are not running on the worker nodes

vipints commented 9 years ago

Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with Exit_status=127, here is one such event, tracejob -slm -n 2 3025874. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.

-bash: module: line 1: syntax error: unexpected end of file -bash: error importing function definition for BASH_FUNC_module' -bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory

I am able to run this python script without Torque on the login machine and a worker node (with qlogin).

Did anybody use drmaa, python combination in cluster computing?

I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.

I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)

tatarsky commented 9 years ago

I have some additional efforts I'll put up here shortly. Torque 5 does appear to support pbs_submit_hash but when the configure script does its conftest to detect that its failing for reasons that smell suspiciously like another name mangling item similar to previous battles.

I am attempting to convince the 3rd party python that library call DOES exist but have been busy with the GPU matter.

vipints commented 9 years ago

Sure, I was searching through the torque codebase to understand the job submission handling, the torque-5.0.1-1_4fa836f5/src/cmds/qsub_functions.c uses the pbs_submit_hash to submit the jobs, while the drmaa part uses the pbs_submit to manage the jobs. I remember we compiled the shared object libdrmaa.so using the torque 5 dist., just recalling the events.

tatarsky commented 9 years ago

Not if the drmaa code configure detects pbs_submit_hash....which for odd reasons it is not.

The define it sets is:

HAVE_PBS_SUBMIT_HASH

I have issolated its conftest and it indeed fails. My homedir hashtest.c

Note yet another extern C possible involvement....

tatarsky commented 9 years ago

Oh, see more what you are saying BTW from the libdrmaa compiles:

submit.c: pbs_job_id = pbs_submit(c->pbs_conn, sc->pbs_attribs, sc->script_filename, NULL, NULL);

tatarsky commented 9 years ago

And I can confirm qsub is definitely using pbs_submit_hash....via a gdb of a foo.sh breaking on that function. So its in Torque 5...but I'm suspecting the libdrmaa isn't coded to use it. Assuming BTW this is the actual issue....

Breakpoint 1, pbs_submit_hash (socket=1, job_attr=0x628170, res_attr=0x628380, script=0x7fffffffba80 "/tmp/qsub.fjSwH1", destination=0x630fd0 "", extend=0x0, 
    return_jobid=0x616ce8, msg=0x7fffffffdc10) at ../Libifl/pbsD_submit_hash.c:109
109     ../Libifl/pbsD_submit_hash.c: No such file or directory.
        in ../Libifl/pbsD_submit_hash.c

tatarsky commented 9 years ago

Another person that has the same issue noted and their notes.

https://oss.trac.surfsara.nl/pbs_python/ticket/54

vipints commented 9 years ago

exactly, I believe so. Looking through the submit.c

tatarsky commented 9 years ago

So I think we need a "new" submit.c that uses the hash function.

tatarsky commented 9 years ago

And possibly some wrapper changes in the python.....

tatarsky commented 9 years ago

More recent torque mailing list discussion from same person:

http://www.clusterresources.com/pipermail/torqueusers/2015-March/017966.html

tatarsky commented 9 years ago

OK. Pausing for a moment here.

I see two paths which we can pursue with Adaptive or in the later via coding:

We determine why pbs_submit appears to "tap out" after some N submits to it in a way besides restarting the pbs_server.
We hack in support for pbs_submit_hash into I believe libdrmaa and I believe fix some glue somewhere between the python and that library. We appear to not be alone in that effort.

vipints commented 9 years ago

Alright a good plan on this issue now. I am looking around point 2) - a way to update the submit.c script.

vipints commented 9 years ago

I may need a pbs_server restart to finish few jobs, I think I crossed the N number of jobs.

tatarsky commented 9 years ago

Yeah give me a bit to see if I can gain any additional insight into things before I restart. Trying to get a solid explanation of this to Adaptive.

vipints commented 9 years ago

Sure, not a problem. I am trying to rewrite few lines in submit.c

tatarsky commented 9 years ago

Nothing new learned. Restarted it.

tatarsky commented 9 years ago

We need to chat. You are trying to fix the library we didn't use....because it was a big mess of C++ v.s. C. Whats a good number to call. I don't want you wasting time fixing what I am quite certain isn't used. Email is fine.

tatarsky commented 9 years ago

The libdrmaa.so we ended up getting to work is from these guys.

http://sourceforge.net/projects/pbspro-drmaa/files/pbs-drmaa/1.0/ And thats whats in /opt/torque/lib/libdrmaa.so. We are running the latest version 1.0.18.

So whatever call it uses gets wrapped by the drmaa-python code.

And its got a complicated set of ifdefs to TRY to support the submit_hash but its not working. Its using pbs_submit.

So if we hack on something, its that library not the one that came with Torque which BTW is an ANCIENT version of the above.

tatarsky commented 9 years ago

strings libdrmaa.so|grep 1.0.18
DRMAA for PBS Pro v. 1.0.18 <http://sourceforge.net/projects/pbspro-drmaa/>

vipints commented 9 years ago

Hi @tatarsky, some thoughts on this issue:

a) recently we found that queued jobs can go over 1024 without causing the issue but it is strictly tied to our total number of jobs executed. Thanks @cganote! for getting the log messages from her side.

b) when executing the step of "configure" under pbs-drmaa-1.0.18 repository I can see the config log message says that: checking for pbs_submit_hash in -ltorque... no Which clearly indicate that pbs-drmaa is using pbs_submit to dispatch the jobs.

tatarsky commented 9 years ago

Interesting on a) and noted before on b). Somewhere in our emails we discuss why that configure fails and it appeared to boil down to the config test program being unable to locate pbs_submit_hash even though its right there in the library. Felt like another c++ name mangling problem IIRC. Will scan emails.

vipints commented 9 years ago

In torque 5, there is a definition for pbs_submit_hash_ext. https://github.com/adaptivecomputing/torque/blob/5.0.0/src/include/pbs_ifl.h#L665

pbs-drmaa-1.0.18 uses pbs_ifl.h header file, which doesn't have the function definition.

The pbs_submit_hash defined in: https://github.com/adaptivecomputing/torque/blob/5.0.0/src/lib/Libifl/lib_ifl.h#L72 which is not importing in pbs-drmaa-1.0.18.

tatarsky commented 9 years ago

Yep. See my link a few above where the exact same discussion takes place.

http://www.clusterresources.com/pipermail/torqueusers/2015-March/017966.html

tatarsky commented 9 years ago

No signs of any change to Adaptive "supported" drmaa libraries in recent 5.1.1 release.

tatarsky commented 9 years ago

The bug where the server needs to be restarted was discussed on the Torque mailing list this weekend and a patch was posted by the lead developer to try. I am looking at it first and will discuss further after an appointment this morning. Its a one line change in a section involved in the pbs_submit area.

vipints commented 9 years ago

Thank you @tatarsky!

tatarsky commented 9 years ago

I'm currently sort of hoping somebody else tests it first as we have no real test environment.

vipints commented 9 years ago

@cganote do you have a test environment to check the patch for pbs_submit function?

cganote commented 9 years ago

I do! Point me at the code =)

From: Vipin notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, September 28, 2015 at 12:03 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Carrie Ganote cganote@iu.edu<mailto:cganote@iu.edu> Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)

@cganotehttps://github.com/cganote do you have a test environment to check the patch for pbs_submit function?

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/246#issuecomment-143789489.

vipints commented 9 years ago

The discussions happened here: http://www.supercluster.org/pipermail/torqueusers/2015-September/018302.html

tatarsky commented 9 years ago

The small patch appears as a removed attachment and here is the link to it. http://www.supercluster.org/pipermail/torqueusers/attachments/20150925/8b910935/attachment-0001.bin

tatarsky commented 9 years ago

While the patch is trivial I am going to hold a notch to monitor that mailing list thread in case it introduces some other issue. We are fully loaded out there on the cluster and I have to exercise care in pbs_server land.

cganote commented 9 years ago

I've installed the patch last week, recompiled Torque, and pointed my DRMAA and pbs_python at the new libtorque. So far so good. I haven't had it fail with the usual message, though my production Galaxy server running DRMAA did go OOM. It seems like there must be a memory leak in DRMAA somewhere. Since the pbs issue is so intermittent, I'll let you know if I see it get into the weird state; otherwise, we'll wait. Silence is good.

-Carrie

From: tatarsky notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, September 28, 2015 at 2:34 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Carrie Ganote cganote@iu.edu<mailto:cganote@iu.edu> Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)

While the patch is trivial I am going to hold a notch to monitor that mailing list thread in case it introduces some other issue. We are fully loaded out there on the cluster and I have to exercise care in pbs_server land.

— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/246#issuecomment-143836777.

tatarsky commented 9 years ago

OK thats very promising and appreciated. What I'm going to do is this as I have the pbs_server binary ready. The next time @vipints requests a server restart due to the bug I will restart with the patched version while retaining the old as fallback. Fair?

vipints commented 9 years ago

Thank you @cganote!

Sure, when I hit the point where the max_num_jobs reached, I will be letting you know.

tatarsky commented 9 years ago

Another mailing list tester BTW has run this patch for quite some time without issues or other side effects. Which is a good sign. We remain not running it but I am ready to do so.

vipints commented 9 years ago

I think we will wait for the next max_job_reached message from DRMAA and then fix with the patch we have.

vipints commented 8 years ago

@tatarsky I have reached the max_num_jobs via drmaa, shall we fix this with the patch? Thank you!

tatarsky commented 8 years ago

It would be wise given I am going to be interrupt driven today (I am on the road) to wait until later in the day when I am in one location for awhile so I can monitor for any issues. Is that OK with you? Otherwise I will just restart pbs_server and wait for the next time when I am not on the road.

vipints commented 8 years ago

Sure I can wait for you. thank you!

tatarsky commented 8 years ago

Boils down to being paranoid this one line patch causes some unexpected issue and I am delayed in seeing it due to my efforts on this other matter I am doing.

vipints commented 8 years ago

So far I was cheating the drmaa with an internal counter (a small hack in my code) and which allows me to run jobs until today (I think we made a pbs_server restart a while back). By now, I feel that we will go with the the one line patch to fix the problem for the long run.

tatarsky commented 8 years ago

I have not gotten to this yet. A series of problems hit me today I did not expect. Advise if you are being delayed by it.

vipints commented 8 years ago

I will wait for this,

tatarsky commented 8 years ago

The patched pbs_server was deployed at 8:06 AM on 10/29.

tatarsky commented 8 years ago

And to be clear, restarted with that version at that time. Old version saved off.

vipints commented 8 years ago

Thank you @tatarsky! example test run was successful. I suggest to keep the ticket open for few more days to watch the performance.

tatarsky commented 8 years ago

Well, I'd like to keep the ticket open until we feel we pbs_submit more than the old "limit" and you don't get blocked from submitting more due to the patch ;) But we're on the same page.

tatarsky commented 8 years ago

Just curious if you have a feel for how many DRMAA jobs you've pushed through so far ;) I'm going to be fascinated if after all this it was literally a one line patch.

vipints commented 8 years ago

Was not in office today and didn't have access to emails... from yesterday morning (Oct 29 9am) to till Fri Oct 30 12:02:11 2015, I have submitted and collected results for 1454 jobs successfully.

cBio / cbio-cluster

Python submitted DRMAA jobs are not running on the worker nodes #246