Closed steven-albanese closed 9 years ago
qsub -I -q active -l walltime=04:00:00 -l nodes=1:ppn=1:gpus=1:shared -l mem=4gb
I show actually there is very little memory out there right now to meet that memory request. mdiag -n can show you the remaining and the only node with free memory we are waiting on a part for. I suspect you could get a login with 2G.
It looks like this problem came up again. Only requesting 1G of memory this time
qsub -I -q active -l walltime=04:00:00 -l nodes=1:ppn=1:gpus=1:shared -l mem=1G
Got a job id? So I can look at the details?
5799291.mskcc-fe1.local
Mdiag -n again shows a very loaded cluster.
Nodes with memory have all 32 slots consumed (and are marked Busy). Nodes with slots left appear to be very sparse with ram...still comparing list.
I don't show resources to run your job right now.
In fact the cluster at the moment is as close to 100% utilized as it gets.
thanks for taking a look at this! I'll remember to check next time
I'm having a couple of side conversations. Things are packed pretty tight out there though. Becomes a bit of a Tetris game.
One trick I do BTW is use the same resource arguments with a batch queue request. It will return a slightly more useful line like:
qsub: submit error (Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or all systems are busy))
Not sure that is solely due to resources. Another interactive job that is currently not scheduled is 5799368. Interestingly, it shows up in the qstat list but not the showq list:
qstat -u <user> | grep 5799368
5799368.mskcc-fe1.loca <user> active STDIN -- -- -- 1gb 02:00:00 Q --
showq -u <user> | grep 5799368
<no output>
Let me know if you need
I'll look when I have a moment. I don't show much load on the head node but yes that seems odd.
Torque and Moab may require restart. I'm not clear whats going on but its not looking overly happy.
OK. So this may be it and I'm not clear on the origin of this moab line:
MAXJOB 20000
I believe that tells moab to only deal with 20000 jobs total from Torque.
I noticed this when showq always shows:
Total jobs: 19999
So I think this is a configuration matter where that value was chosen to prevent overloading. But may have this confusing side effect.
I am reviewing the docs but this value pre-dates me.
Well, from the horses mouth:
MAXJOB
Specifies the maximum quantity of jobs for which Moab should allocate memory used for tracking jobs. If Moab is tracking the maximum quantity of jobs specified by this parameter, it rejects subsequent jobs submitted by any user since it has no memory left with which to track newly submitted jobs.
So its doing what we've told it to do. The question is do we wish to tell it to do something different.
@KjongLehmann appears to represent 17K of that current job load.
I think this setting was made here: https://github.com/cBio/cbio-cluster/issues/85
One quote from that thread: "To solve the issue of unlimited jobs we set moab on our last call with AC to only look at the first 20k jobs at a time. And yes, each array job is counted as a job since it has to be scheduled as a job."
yes, I was unaware of that total job limit. All jobs are simple low priority jobs which I thought can trickle through. Anyways, in process of deleting jobs.
On Sep 15, 2015, at 1:07 PM, tatarsky notifications@github.com wrote:
@KjongLehmann appears to represent 17K of that current job load.
— Reply to this email directly or view it on GitHub.
Well, I might just try bumping it to confirm. Hold on a sec.
Could we just lift the limit to 50K? @tatarsky you mentioned that there is not too much load on the scheduler node.
Sorry, cross posting.
Sorry, already started deleting, but I can fill it up again to confirm?
On Sep 15, 2015, at 1:09 PM, Andre Kahles notifications@github.com wrote:
Could we just lift the limit to 50K? @tatarsky you mentioned that there is not too much load on the scheduler node.
— Reply to this email directly or view it on GitHub.
I'm going to bump it via a smaller number than doubling it.
Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.
filling up again.
On Sep 15, 2015, at 1:11 PM, tatarsky notifications@github.com wrote:
Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.
— Reply to this email directly or view it on GitHub.
Was able to add 20K plus
On Sep 15, 2015, at 1:12 PM, Lehmann, Kjong Van S./Sloan-Kettering Institute lehmann@cbio.mskcc.org wrote:
filling up again.
On Sep 15, 2015, at 1:11 PM, tatarsky notifications@github.com wrote:
Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.
— Reply to this email directly or view it on GitHub.
Yes.
So its definitely related to this issue and we need to come to some sort of agreement about how to handle it.
It appears when the MAXJOB limit of moab is hit basically the Torque jobs submitted after that point live in a bit of a limbo state. I am checking to see if there is basically a Torque config item to stop even taking further jobs.
I can clearly see now the qstat/showq output contains my active request job because moab still has some head room.
I do not want to up the Moab maximum further without additional study but will add a Ganglia graph of it and alert.
I can continue filling to 25K and see whether we re-encounter the same problem?
On Sep 15, 2015, at 1:20 PM, Lehmann, Kjong Van S./Sloan-Kettering Institute lehmann@cbio.mskcc.org wrote:
Was able to add 20K plus
On Sep 15, 2015, at 1:12 PM, Lehmann, Kjong Van S./Sloan-Kettering Institute lehmann@cbio.mskcc.org wrote:
filling up again.
On Sep 15, 2015, at 1:11 PM, tatarsky notifications@github.com wrote:
Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.
— Reply to this email directly or view it on GitHub.
No, I'm pretty sure its simply doing what it was told to do. It handles job slots up to MAXJOB.
K, will reduce load again.
On Sep 15, 2015, at 1:21 PM, tatarsky notifications@github.com wrote:
No, I'm pretty sure its simply doing what it was told to do. It handles job slots up to MAXJOB.
— Reply to this email directly or view it on GitHub.
I am not clear on why you are able to exceed the Torque setting:
set queue batch max_user_queuable = 5000
they are low priority jobs. the hope was that they slowly trickle through. i can take over some of the maintenance though.
On Sep 15, 2015, at 1:22 PM, tatarsky notifications@github.com wrote:
I am not clear on why you are able to exceed the Torque setting:
set queue batch max_user_queuable = 5000 — Reply to this email directly or view it on GitHub.
I think we need to cap that queue as well.
capping lowpriority to 5K for now. Will think about it more.
End result of investigation for now is we've upped MAXJOB to 25K and capped lowpriority. I believe @steven-albanese your issues were a combo of high queue count and high resource allocations. Monitoring and adding some alerts. Asking for an update on the larger head node Git request over in the admin area.
Continue to monitor this limit and its impact here. Leaving this open as I do so.
I am of the opinion this was definitely Moab MAXJOB related and I now monitor for that (and alert) more closely. If you see an instance of this again we will start with checking there as well as the resources. I am closing for now but feel free to re-open.
I seem to be having trouble requesting interactive sessions. I get stuck waiting for the job to start.
The following is what I'm requesting: