Open vipints opened 9 years ago
I have some additional efforts I'll put up here shortly. Torque 5 does appear to support pbs_submit_hash but when the configure script does its conftest to detect that its failing for reasons that smell suspiciously like another name mangling item similar to previous battles.
I am attempting to convince the 3rd party python that library call DOES exist but have been busy with the GPU matter.
Sure, I was searching through the torque codebase to understand the job submission handling, the torque-5.0.1-1_4fa836f5/src/cmds/qsub_functions.c uses the pbs_submit_hash to submit the jobs, while the drmaa part uses the pbs_submit to manage the jobs. I remember we compiled the shared object libdrmaa.so
using the torque 5 dist., just recalling the events.
Not if the drmaa code configure detects pbs_submit_hash....which for odd reasons it is not.
The define it sets is:
HAVE_PBS_SUBMIT_HASH
I have issolated its conftest and it indeed fails. My homedir hashtest.c
Note yet another extern C possible involvement....
Oh, see more what you are saying BTW from the libdrmaa compiles:
submit.c: pbs_job_id = pbs_submit(c->pbs_conn, sc->pbs_attribs, sc->script_filename, NULL, NULL);
And I can confirm qsub is definitely using pbs_submit_hash....via a gdb of a foo.sh breaking on that function. So its in Torque 5...but I'm suspecting the libdrmaa isn't coded to use it. Assuming BTW this is the actual issue....
Breakpoint 1, pbs_submit_hash (socket=1, job_attr=0x628170, res_attr=0x628380, script=0x7fffffffba80 "/tmp/qsub.fjSwH1", destination=0x630fd0 "", extend=0x0,
return_jobid=0x616ce8, msg=0x7fffffffdc10) at ../Libifl/pbsD_submit_hash.c:109
109 ../Libifl/pbsD_submit_hash.c: No such file or directory.
in ../Libifl/pbsD_submit_hash.c
Another person that has the same issue noted and their notes.
exactly, I believe so. Looking through the submit.c
So I think we need a "new" submit.c that uses the hash function.
And possibly some wrapper changes in the python.....
More recent torque mailing list discussion from same person:
http://www.clusterresources.com/pipermail/torqueusers/2015-March/017966.html
OK. Pausing for a moment here.
I see two paths which we can pursue with Adaptive or in the later via coding:
Alright a good plan on this issue now. I am looking around point 2) - a way to update the submit.c script.
I may need a pbs_server restart to finish few jobs, I think I crossed the N number of jobs.
Yeah give me a bit to see if I can gain any additional insight into things before I restart. Trying to get a solid explanation of this to Adaptive.
Sure, not a problem. I am trying to rewrite few lines in submit.c
Nothing new learned. Restarted it.
We need to chat. You are trying to fix the library we didn't use....because it was a big mess of C++ v.s. C. Whats a good number to call. I don't want you wasting time fixing what I am quite certain isn't used. Email is fine.
The libdrmaa.so we ended up getting to work is from these guys.
http://sourceforge.net/projects/pbspro-drmaa/files/pbs-drmaa/1.0/ And thats whats in /opt/torque/lib/libdrmaa.so. We are running the latest version 1.0.18.
So whatever call it uses gets wrapped by the drmaa-python code.
And its got a complicated set of ifdefs to TRY to support the submit_hash but its not working. Its using pbs_submit.
So if we hack on something, its that library not the one that came with Torque which BTW is an ANCIENT version of the above.
strings libdrmaa.so|grep 1.0.18
DRMAA for PBS Pro v. 1.0.18 <http://sourceforge.net/projects/pbspro-drmaa/>
Hi @tatarsky, some thoughts on this issue:
a) recently we found that queued jobs can go over 1024 without causing the issue but it is strictly tied to our total number of jobs executed. Thanks @cganote! for getting the log messages from her side.
b) when executing the step of "configure" under pbs-drmaa-1.0.18
repository I can see the config log message says that:
checking for pbs_submit_hash in -ltorque... no
Which clearly indicate that pbs-drmaa is using pbs_submit to dispatch the jobs.
Interesting on a) and noted before on b). Somewhere in our emails we discuss why that configure fails and it appeared to boil down to the config test program being unable to locate pbs_submit_hash even though its right there in the library. Felt like another c++ name mangling problem IIRC. Will scan emails.
In torque 5, there is a definition for pbs_submit_hash_ext
.
https://github.com/adaptivecomputing/torque/blob/5.0.0/src/include/pbs_ifl.h#L665
pbs-drmaa-1.0.18
uses pbs_ifl.h
header file, which doesn't have the function definition.
The pbs_submit_hash
defined in:
https://github.com/adaptivecomputing/torque/blob/5.0.0/src/lib/Libifl/lib_ifl.h#L72
which is not importing in pbs-drmaa-1.0.18
.
Yep. See my link a few above where the exact same discussion takes place.
http://www.clusterresources.com/pipermail/torqueusers/2015-March/017966.html
No signs of any change to Adaptive "supported" drmaa libraries in recent 5.1.1 release.
The bug where the server needs to be restarted was discussed on the Torque mailing list this weekend and a patch was posted by the lead developer to try. I am looking at it first and will discuss further after an appointment this morning. Its a one line change in a section involved in the pbs_submit area.
Thank you @tatarsky!
I'm currently sort of hoping somebody else tests it first as we have no real test environment.
@cganote do you have a test environment to check the patch for pbs_submit function?
I do! Point me at the code =)
From: Vipin notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, September 28, 2015 at 12:03 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Carrie Ganote cganote@iu.edu<mailto:cganote@iu.edu> Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)
@cganotehttps://github.com/cganote do you have a test environment to check the patch for pbs_submit function?
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/246#issuecomment-143789489.
The discussions happened here: http://www.supercluster.org/pipermail/torqueusers/2015-September/018302.html
The small patch appears as a removed attachment and here is the link to it. http://www.supercluster.org/pipermail/torqueusers/attachments/20150925/8b910935/attachment-0001.bin
While the patch is trivial I am going to hold a notch to monitor that mailing list thread in case it introduces some other issue. We are fully loaded out there on the cluster and I have to exercise care in pbs_server land.
I've installed the patch last week, recompiled Torque, and pointed my DRMAA and pbs_python at the new libtorque. So far so good. I haven't had it fail with the usual message, though my production Galaxy server running DRMAA did go OOM. It seems like there must be a memory leak in DRMAA somewhere. Since the pbs issue is so intermittent, I'll let you know if I see it get into the weird state; otherwise, we'll wait. Silence is good.
-Carrie
From: tatarsky notifications@github.com<mailto:notifications@github.com> Reply-To: cBio/cbio-cluster reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, September 28, 2015 at 2:34 PM To: cBio/cbio-cluster cbio-cluster@noreply.github.com<mailto:cbio-cluster@noreply.github.com> Cc: Carrie Ganote cganote@iu.edu<mailto:cganote@iu.edu> Subject: Re: [cbio-cluster] Python submitted DRMAA jobs are not running on the worker nodes (#246)
While the patch is trivial I am going to hold a notch to monitor that mailing list thread in case it introduces some other issue. We are fully loaded out there on the cluster and I have to exercise care in pbs_server land.
— Reply to this email directly or view it on GitHubhttps://github.com/cBio/cbio-cluster/issues/246#issuecomment-143836777.
OK thats very promising and appreciated. What I'm going to do is this as I have the pbs_server binary ready. The next time @vipints requests a server restart due to the bug I will restart with the patched version while retaining the old as fallback. Fair?
Thank you @cganote!
Sure, when I hit the point where the max_num_jobs
reached, I will be letting you know.
Another mailing list tester BTW has run this patch for quite some time without issues or other side effects. Which is a good sign. We remain not running it but I am ready to do so.
I think we will wait for the next max_job_reached
message from DRMAA and then fix with the patch we have.
@tatarsky I have reached the max_num_jobs
via drmaa, shall we fix this with the patch? Thank you!
It would be wise given I am going to be interrupt driven today (I am on the road) to wait until later in the day when I am in one location for awhile so I can monitor for any issues. Is that OK with you? Otherwise I will just restart pbs_server and wait for the next time when I am not on the road.
Sure I can wait for you. thank you!
Boils down to being paranoid this one line patch causes some unexpected issue and I am delayed in seeing it due to my efforts on this other matter I am doing.
So far I was cheating the drmaa
with an internal counter (a small hack in my code) and which allows me to run jobs until today (I think we made a pbs_server restart a while back). By now, I feel that we will go with the the one line patch to fix the problem for the long run.
I have not gotten to this yet. A series of problems hit me today I did not expect. Advise if you are being delayed by it.
I will wait for this,
The patched pbs_server was deployed at 8:06 AM on 10/29.
And to be clear, restarted with that version at that time. Old version saved off.
Thank you @tatarsky! example test run was successful. I suggest to keep the ticket open for few more days to watch the performance.
Well, I'd like to keep the ticket open until we feel we pbs_submit more than the old "limit" and you don't get blocked from submitting more due to the patch ;) But we're on the same page.
Just curious if you have a feel for how many DRMAA jobs you've pushed through so far ;) I'm going to be fascinated if after all this it was literally a one line patch.
Was not in office today and didn't have access to emails... from yesterday morning (Oct 29 9am) to till Fri Oct 30 12:02:11 2015, I have submitted and collected results for 1454
jobs successfully.
Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with
Exit_status=127
, here is one such event,tracejob -slm -n 2 3025874
. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for BASH_FUNC_module'
-bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory
I am able to run this python script without Torque on the login machine and a worker node (with qlogin).
Did anybody use drmaa, python combination in cluster computing?
I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.
I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)