cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Queued jobs keep increasing in estimated reservation start times #413

Open jchodera opened 8 years ago

jchodera commented 8 years ago

I have a few multicore jobs queued up:

[chodera@mskcc-ln1 ~]$ qstat -u chodera

hal-sched1.local: 
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
7190964.hal-sched1.loc  chodera     batch    cluster-src         --      1     32   96gb  16:00:00 Q       -- 
7196588.hal-sched1.loc  chodera     batch    cluster-abls        --      1     16   96gb  24:00:00 Q       -- 
7196592.hal-sched1.loc  chodera     batch    cluster-abl-1140    --      1     32   96gb  01:00:00 Q       -- 
7196599.hal-sched1.loc  chodera     batch    cluster-src-1140    --      1     32   96gb  03:00:00 Q       -- 

The estimated start time (from showstart) was first a day, and then after a day was still a day, and now (after another day) is up to NINE DAYS:

[chodera@mskcc-ln1 ~]$ showstart 7190964
job 7190964 requires 32 procs for 16:00:00

Estimated Rsv based start in              9:02:46:33 on Mon May 23 20:25:03
Estimated Rsv based completion in         9:18:46:33 on Tue May 24 12:25:03

Best Partition: MSKCC

What is going on here?

jchodera commented 8 years ago

Meanwhile, there are 2205 idle threadslots:

[chodera@mskcc-ln1 ~]$ checkjob  7190964
job 7190964

AName: cluster-src
State: Idle 
Creds:  user:chodera  group:jclab  class:batch  qos:preemptor
WallTime:   00:00:00 of 16:00:00
BecameEligible: Wed May 11 18:50:16
SubmitTime: Wed May 11 18:49:00
  (Time Queued  Total: 2:22:52:11  Eligible: 2:22:51:24)

TemplateSets:  DEFAULT
Total Requested Tasks: 32

Req[0]  TaskCount: 32  Partition: ALL
Opsys: ---  Arch: ---  Features: batch
Dedicated Resources Per Task: PROCS: 1  MEM: 3072M

SystemID:   MSKCC
SystemJID:  7190964
Notification Events: JobFail

BypassCount:    12848
Flags:          RESTARTABLE,SUSPENDABLE,PREEMPTOR
Attr:           checkpoint
StartPriority:  12469
NOTE:  job req cannot run in partition MSKCC (available procs do not meet requirements : 0 of 32 procs found)
idle procs: 2205  feasible procs:   0

Node Rejection Summary: [Features: 39][State: 5][Reserved: 34]
jchodera commented 8 years ago

So much for including those results in my talk...

tatarsky commented 8 years ago

Explain the above please.

jchodera commented 8 years ago

Here's what happened:

In the original design requirements, we selected Torque/Moab because it could provide max wait time estimates for jobs for precisely occassions like this involving hard deadlines: If your job was predicted to take too long, you could try to alternatively partition it to get it done in the required time (or negotiate with other groups to hold/stop some jobs).

Somehow, max wait time reporting is now broken. Or something very non-obvious is going on with these jobs.

tatarsky commented 8 years ago

I refer to your abusive comment. I have filed a complaint and will not assist you further.

jchodera commented 8 years ago

I refer to your abusive comment. I have filed a complaint and will not assist you further.

I'm genuinely unclear on which part was the abusive comment. Was it the reference for not getting data in time for my talk? I'm certainly not trying to offend anyone!

jchodera commented 8 years ago

Oh! If it was the "Since I have no idea how the hell support is being handled now" comment in the mail to hpc-request, that was simply my confusion about feeling in the dark about the proper procedure for reporting issues. I think there was a migration away from primary support via this issue tracker, as I was surprised to see the hal login message had suddenly changed without fanfare to request users email hpc-request. This was certainly not a comment on response times or service quality, just a commentary on being left in the dark regarding support request procedures. Apologies if that led to offense---it certainly was unintended.

jchodera commented 8 years ago

To be clear: We've received no official communication regarding a change in support procedure.

tatarsky commented 8 years ago

I consider "So much for including those results in my talk..." unnecessary and abusive.

I provide support within the terms of my scope of work. None of which includes weekend support but I do it anyway. I work very hard to resolve problems and do not deserve such statements.

Talk to Juan Perin on Monday. You'll get not further help from me.

jchodera commented 8 years ago

Clarifying email sent. Will talk to Juan when I return from conference. Weekend support certainly not expected or needed---deadline for getting data in time for writing talk had already passed when issue was filed.