cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

trouble requesting interactive sessions #310

Closed steven-albanese closed 9 years ago

steven-albanese commented 9 years ago

I seem to be having trouble requesting interactive sessions. I get stuck waiting for the job to start.

The following is what I'm requesting:

qsub -I -q active -l walltime=04:00:00 -l nodes=1:ppn=1:gpus=1:shared -l mem=4G
vipints commented 9 years ago

qsub -I -q active -l walltime=04:00:00 -l nodes=1:ppn=1:gpus=1:shared -l mem=4gb

tatarsky commented 9 years ago

I show actually there is very little memory out there right now to meet that memory request. mdiag -n can show you the remaining and the only node with free memory we are waiting on a part for. I suspect you could get a login with 2G.

steven-albanese commented 9 years ago

It looks like this problem came up again. Only requesting 1G of memory this time

qsub -I -q active -l walltime=04:00:00 -l nodes=1:ppn=1:gpus=1:shared -l mem=1G
tatarsky commented 9 years ago

Got a job id? So I can look at the details?

steven-albanese commented 9 years ago

5799291.mskcc-fe1.local

tatarsky commented 9 years ago

Mdiag -n again shows a very loaded cluster.

tatarsky commented 9 years ago

Nodes with memory have all 32 slots consumed (and are marked Busy). Nodes with slots left appear to be very sparse with ram...still comparing list.

tatarsky commented 9 years ago

I don't show resources to run your job right now.

tatarsky commented 9 years ago

In fact the cluster at the moment is as close to 100% utilized as it gets.

steven-albanese commented 9 years ago

thanks for taking a look at this! I'll remember to check next time

tatarsky commented 9 years ago

I'm having a couple of side conversations. Things are packed pretty tight out there though. Becomes a bit of a Tetris game.

tatarsky commented 9 years ago

One trick I do BTW is use the same resource arguments with a batch queue request. It will return a slightly more useful line like:

qsub: submit error (Job exceeds queue resource limits MSG=cannot locate feasible nodes (nodes file is empty or all systems are busy))
akahles commented 9 years ago

Not sure that is solely due to resources. Another interactive job that is currently not scheduled is 5799368. Interestingly, it shows up in the qstat list but not the showq list:

qstat -u <user> | grep 5799368
5799368.mskcc-fe1.loca  <user>      active   STDIN               --    --     --     1gb  02:00:00 Q       -- 
showq -u <user> | grep 5799368
<no output>

Let me know if you need but this should be evident from jobid. Could it be that there are too many jobs in the system, slowing the Q down?

tatarsky commented 9 years ago

I'll look when I have a moment. I don't show much load on the head node but yes that seems odd.

tatarsky commented 9 years ago

Torque and Moab may require restart. I'm not clear whats going on but its not looking overly happy.

tatarsky commented 9 years ago

OK. So this may be it and I'm not clear on the origin of this moab line:

MAXJOB 20000

I believe that tells moab to only deal with 20000 jobs total from Torque.

I noticed this when showq always shows:

Total jobs: 19999

So I think this is a configuration matter where that value was chosen to prevent overloading. But may have this confusing side effect.

I am reviewing the docs but this value pre-dates me.

tatarsky commented 9 years ago

Well, from the horses mouth:

MAXJOB

Specifies the maximum quantity of jobs for which Moab should allocate memory used for tracking jobs. If Moab is tracking the maximum quantity of jobs specified by this parameter, it rejects subsequent jobs submitted by any user since it has no memory left with which to track newly submitted jobs.

So its doing what we've told it to do. The question is do we wish to tell it to do something different.

tatarsky commented 9 years ago

@KjongLehmann appears to represent 17K of that current job load.

akahles commented 9 years ago

I think this setting was made here: https://github.com/cBio/cbio-cluster/issues/85

One quote from that thread: "To solve the issue of unlimited jobs we set moab on our last call with AC to only look at the first 20k jobs at a time. And yes, each array job is counted as a job since it has to be scheduled as a job."

KjongLehmann commented 9 years ago

yes, I was unaware of that total job limit. All jobs are simple low priority jobs which I thought can trickle through. Anyways, in process of deleting jobs.

On Sep 15, 2015, at 1:07 PM, tatarsky notifications@github.com wrote:

@KjongLehmann appears to represent 17K of that current job load.

— Reply to this email directly or view it on GitHub.

tatarsky commented 9 years ago

Well, I might just try bumping it to confirm. Hold on a sec.

akahles commented 9 years ago

Could we just lift the limit to 50K? @tatarsky you mentioned that there is not too much load on the scheduler node.

akahles commented 9 years ago

Sorry, cross posting.

KjongLehmann commented 9 years ago

Sorry, already started deleting, but I can fill it up again to confirm?

On Sep 15, 2015, at 1:09 PM, Andre Kahles notifications@github.com wrote:

Could we just lift the limit to 50K? @tatarsky you mentioned that there is not too much load on the scheduler node.

— Reply to this email directly or view it on GitHub.

tatarsky commented 9 years ago

I'm going to bump it via a smaller number than doubling it.

tatarsky commented 9 years ago

Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.

KjongLehmann commented 9 years ago

filling up again.

On Sep 15, 2015, at 1:11 PM, tatarsky notifications@github.com wrote:

Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.

— Reply to this email directly or view it on GitHub.

KjongLehmann commented 9 years ago

Was able to add 20K plus

On Sep 15, 2015, at 1:12 PM, Lehmann, Kjong Van S./Sloan-Kettering Institute lehmann@cbio.mskcc.org wrote:

filling up again.

On Sep 15, 2015, at 1:11 PM, tatarsky notifications@github.com wrote:

Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.

— Reply to this email directly or view it on GitHub.

tatarsky commented 9 years ago

Yes.

So its definitely related to this issue and we need to come to some sort of agreement about how to handle it.

It appears when the MAXJOB limit of moab is hit basically the Torque jobs submitted after that point live in a bit of a limbo state. I am checking to see if there is basically a Torque config item to stop even taking further jobs.

I can clearly see now the qstat/showq output contains my active request job because moab still has some head room.

I do not want to up the Moab maximum further without additional study but will add a Ganglia graph of it and alert.

KjongLehmann commented 9 years ago

I can continue filling to 25K and see whether we re-encounter the same problem?

On Sep 15, 2015, at 1:20 PM, Lehmann, Kjong Van S./Sloan-Kettering Institute lehmann@cbio.mskcc.org wrote:

Was able to add 20K plus

On Sep 15, 2015, at 1:12 PM, Lehmann, Kjong Van S./Sloan-Kettering Institute lehmann@cbio.mskcc.org wrote:

filling up again.

On Sep 15, 2015, at 1:11 PM, tatarsky notifications@github.com wrote:

Bumping to 25K for test purposes. We need to discuss this in context of another conversation of getting a larger head node.

— Reply to this email directly or view it on GitHub.

tatarsky commented 9 years ago

No, I'm pretty sure its simply doing what it was told to do. It handles job slots up to MAXJOB.

KjongLehmann commented 9 years ago

K, will reduce load again.

On Sep 15, 2015, at 1:21 PM, tatarsky notifications@github.com wrote:

No, I'm pretty sure its simply doing what it was told to do. It handles job slots up to MAXJOB.

— Reply to this email directly or view it on GitHub.

tatarsky commented 9 years ago

I am not clear on why you are able to exceed the Torque setting:

set queue batch max_user_queuable = 5000
KjongLehmann commented 9 years ago

they are low priority jobs. the hope was that they slowly trickle through. i can take over some of the maintenance though.

On Sep 15, 2015, at 1:22 PM, tatarsky notifications@github.com wrote:

I am not clear on why you are able to exceed the Torque setting:

set queue batch max_user_queuable = 5000 — Reply to this email directly or view it on GitHub.

tatarsky commented 9 years ago

I think we need to cap that queue as well.

tatarsky commented 9 years ago

capping lowpriority to 5K for now. Will think about it more.

tatarsky commented 9 years ago

End result of investigation for now is we've upped MAXJOB to 25K and capped lowpriority. I believe @steven-albanese your issues were a combo of high queue count and high resource allocations. Monitoring and adding some alerts. Asking for an update on the larger head node Git request over in the admin area.

tatarsky commented 9 years ago

Continue to monitor this limit and its impact here. Leaving this open as I do so.

tatarsky commented 9 years ago

I am of the opinion this was definitely Moab MAXJOB related and I now monitor for that (and alert) more closely. If you see an instance of this again we will start with checking there as well as the resources. I am closing for now but feel free to re-open.