Open vipints opened 9 years ago
That bash error looks suspiciously like what the bash patch for "shellshock" says when an improper function invocation is attempted.
BTW are you saying "this worked before yesterday" ???
yes this scripts were running perfectly until yesterday.
Did you perhaps "add a module" yesterday? As that error has to do with as far as I can tell the "modules" package.
No
Your .bashrc is modified as of this morning....what was changed?
Also attempting to reproduce....
Just deleted an empty line, that is what I remember...
Well I'll have to look around. No changes I can think of on the cluster except the epilog script which only fires if the queue is "active".
I didn't make any changes to the scripts. thank you @tatarsky.
Do you know BTW what "the worker node" was in that message? Can dig around but if you already know it would be appreciated.
gpu-1-13
I asked for nodes=1
and ppn=4
and it dispatched:
exec_host=gpu-1-13/6+gpu-1-17/19+gpu-1-15/10+gpu-3-8/14
Yeah I saw that. Does your script contain an attempt to use "module" ? Or perhaps provide me the location of the item you run? That error is coming from the modules /etc/profile.d/modules.sh as far as I can tell. Which is untouched, so I'm curious whats calling it.
sending an email with details.
Thanks!
Made title of this more specific for my tracking purposes.
Under some condition the method DRMAA python uses to submit jobs appears to get blocked from submitting more data. I have @vipints running again but am chasing down what the resolution was for this Torque mailing list discussion.
http://www.supercluster.org/pipermail/torqueusers/2014-January/016732.html
I do not believe the hotfix "introduced" this problem as the date of this is old. Opening ticket with Adaptive to enquire.
Hi @tatarsky, today morning I noticed that python drmaa submitted jobs are not dispatching to the worker node. I am not able to see the start time: for example: showstart 3122268
INFO: cannot determine start time for job 3122268
Don't know what is happening here.
I don't see the same issue as before.
Looks to me like a simple case of your jobs being rejected due to resources.
checkjob -v 3122268
Node Availability for Partition MSKCC --------
gpu-3-9 rejected: Features
gpu-1-4 rejected: Features
gpu-1-5 rejected: Features
gpu-1-6 rejected: Features
gpu-1-7 rejected: HostList
gpu-1-8 rejected: HostList
gpu-1-9 rejected: HostList
gpu-1-10 rejected: HostList
gpu-1-11 rejected: HostList
gpu-1-12 rejected: Features
gpu-1-13 rejected: Features
gpu-1-14 rejected: Features
gpu-1-15 rejected: Features
gpu-1-16 rejected: Features
gpu-1-17 rejected: Features
gpu-2-4 rejected: HostList
gpu-2-5 rejected: HostList
gpu-2-6 rejected: Features
gpu-2-7 rejected: HostList
gpu-2-8 rejected: Features
gpu-2-9 rejected: Features
gpu-2-10 rejected: HostList
gpu-2-11 rejected: Features
gpu-2-12 rejected: Features
gpu-2-13 rejected: HostList
gpu-2-14 rejected: Features
gpu-2-15 rejected: Features
gpu-2-16 rejected: Features
gpu-2-17 rejected: Features
gpu-3-8 rejected: Features
cpu-6-1 rejected: Features
cpu-6-2 rejected: HostList
NOTE: job req cannot run in partition MSKCC (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 608 feasible procs: 0
Node Rejection Summary: [Features: 21][HostList: 11]
thanks @tatarsky, I saw this message, forgot to include in the previous message. Not sure why it got rejected as I am requesting limited resources 12gb mem
and 40hrs cput_time
.
This is a little weird, perhaps a syntax error?
Features: cpu-6-2
So it seems to be asking for a feature of a hostname....
Its weird if you look at the "required hostlist" cpu-6-2 does not appear in it yet I see you requesting it.
Opsys: --- Arch: --- Features: cpu-6-2
Required HostList: [gpu-1-12:1][gpu-1-13:1][gpu-1-16:1][gpu-1-17:1][gpu-1-14:1][gpu-1-15:1]
[cpu-6-1:1][gpu-3-8:1][gpu-3-9:1][gpu-1-4:1][gpu-1-5:1][gpu-1-6:1]
[gpu-2-17:1][gpu-2-16:1][gpu-2-15:1][gpu-2-14:1][gpu-2-12:1][gpu-2-11:1]
[gpu-2-6:1][gpu-2-9:1][gpu-2-8:1]
From the queue file...
<submit_args flags="1"> -N pj_41d1c2f4-e8c0-11e4-97d2-5fd54d3e274e -l mem=12gb -l vmem=12gb -l pmem=12gb -l pvmem=12gb
-l nodes=1:ppn=1 -l walltime=40:00:00 -l host=gpu-1-12+gpu-1-13+gpu-1-16+gpu-1-17+gpu-1-14+gpu-1-15+cpu-6-2+cpu-6-1+gpu-3-8
+gpu-3-9+gpu-1-4+gpu-1-5+gpu-1-6+gpu-2-17+gpu-2-16+gpu-2-15+gpu-2-14+gpu-2-12+gpu-2-11+gpu-2-6+gpu-2-9+gpu-2-8</submit_args
>```
yes correct, I am requesting specific hostnames in my submission argument. Due to the OOM issue I have blacklisted the following nodes ['gpu-1-10', 'gpu-1-9', 'gpu-1-8', 'gpu-1-11', 'gpu-1-7', 'gpu-2-5', 'gpu-2-13', 'gpu-2-7', 'gpu-2-4', 'gpu-2-10']
Try the submit without the blacklist. The OOM issue is not node related. I continue to work on the best solution to it.
Sure trying now. Thank you!
To be clear its to try to figure out what looks to me like a valid "host=" stanza got interpreted as a feature request....
Ok, when drmaa mention the argument -l host
in job submission, the job will stay in the queue. I removed the host argument from the submission, with that the jobs are submitted and finished successfully.
So I suspect its parsing it wrong somewhere but I have no clue where. Noted for next dive into the code for now. Working to remove reasons you are doing the hostlist in #241
thanks! I am just clearing the queued jobs.
checking the native specification of drmaa
I forgot how we were doing on this. No obvious config issues have been unearthed yet.
Hi @tatarsky, after restarting the pbs_server there was no issues to submit jobs via python drmaa. I tried submitting many number of jobs and they are successful.
Few days back I got an error message drmaa.errors.InternalException: code 1: (qsub) cannot access script file:
but after 10 min when I tried it was working smoothly. Not much details associated this error.
OK. I'm leaving this open for a bit in case I am able to reproduce whatever happened.
Hi @tatarsky, I think DRMAA python reached the submission limit of the # of jobs via the drmaa qsub wrapper. I am not able to submit more jobs to the worker nodes. I think the pbs_server restart helped to resolve this problem last time. Can you help me here.
thank you!
Reproduced with my simple test. Restarted. Collected logs to add to ticket.
Thank you very much @tatarsky !
Just my 2 cents - I struggle with this issue as well, at least it sounds similar. We have to restart pbs_server every day or so because we get too many jobs submitted and then pbs doesn't respond. On restart, I get a bunch of messages that the jobs were in a strange substate. Let me know if I can help. This also happens using pbs_python, though the error messages are different.
Yeah I actually saw your discussion on this I believe. When I return to looking at this I will let you know if I get anywhere.
Hi @tatarsky, I may need your help to restart the pbs_server as I reached the limit (# of job submission via python drmaa) which is imposed by the qsub drmaa wrapper.
I've restarted it. I've not had time to delve into this.
thanks @tatarsky.
Hi @tatarsky, by late evening today I am getting job exit status 127 handled
error message to my python drmaa submitted jobs. Job IDs are 3387611 - 3387680
, they are just dispatching to the compute node and stalled at the next second. I double checked the paths used by the program, they all looks good. Any idea what's going wrong here? thanks
Restarted pbs_server. See if thats the issue again. But this isn't something I'm going to support like this going forward. Its obviously got problems.
On Monday I'd like a call to discuss how this software will be handled if we cannot find a fix for this. And how many hours we want to devote to finding that fix.
thanks @tatarsky, I just restarted my jobs now and they are running now.
I agree with your statement and I am going to look at the codebase tomorrow, talk to you then.
Today has had many issues and I have to cycle on a few more still. Perhaps sometime tomorrow we could review the plan of attack for this python.
Possible clue from config.log of our pbs_drmaa:
configure:16224: checking for pbs_submit_hash in -ltorque
/home/paul/pbs-drmaa-1.0.18/conftest.c:34: undefined reference to `pbs_submit_hash'
The discussion I reference back a few seems to state this all has something to do with that function.
Another clue:
http://www.supercluster.org/pipermail/torqueusers/2013-November/016514.html
Thats the author stating the version I believe before the one we used has this properly but ifdef'd. I wonder if its a "4" v.s. "5" sort of thing
Hi @tatarsky I believe this seems to be a "4" vs "5" issue.
Since yesterday, I am struggling to debug an error caused by running python scripts via drmaa module. My compute jobs are failing with
Exit_status=127
, here is one such event,tracejob -slm -n 2 3025874
. drmaa module is able to dispatch the job to the worker node with all necessary PATH variables but it fails just after that (only using a single second). The log file didn't give much information.-bash: module: line 1: syntax error: unexpected end of file
-bash: error importing function definition for BASH_FUNC_module'
-bash: line 1: /var/spool/torque/mom_priv/jobs/3025874.mskcc-fe1.local.SC: No such file or directory
I am able to run this python script without Torque on the login machine and a worker node (with qlogin).
Did anybody use drmaa, python combination in cluster computing?
I checked the drmaa job environment, all env PATH variables are loaded correctly. I am not sure why the worker node is kicking out my job.
I am not quite sure, how I proceed the debugging/where to look, any suggestions/help :)