frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

fl-hn2 outage this evening #230

Closed tatarsky closed 6 years ago

tatarsky commented 6 years ago

While I've not had time to look in detail the fl-hn2 system went offline and rebooted. Seemed to cause an issue on fl-hn1 as well. Unclear of what at this time.

Will look more in morning.

Check your jobs as SGE was restarted.

Brought hub back up. Report issues here and will look in morning.

joreynajr commented 6 years ago

Hi Paul,

I ran some qalter commands yesterday trying to change some job resources on about 300 - 400 jobs. Afterwards I had issues using qstat and shortly after the cluster wasn't responding. I'm pretty sure this was the issue.

tatarsky commented 6 years ago

Yeah what I see appears to be a kernel panic in the NFS portion of the kernel. Which is only used to service the I/O from the old nodes. (Which should be low impact). Was your jobs in the arena of the old nodes?

joreynajr commented 6 years ago

They were. On Friday I was attempting to switch my over these jobs to the new nodes using qalter -u all.q $job but the jobs were still not running by yesterday evening. I thought it was because I messed up something when I used qualter so tried to undo it using another qalter command and that's when it happened.

tatarsky commented 6 years ago

I believe from running an "explain" on some of your queued jobs the fact they still have "-l opt" defined as a hard resource will prevent them from moving queues.

hard resource_list:         opt=TRUE,h_vmem=2.0G
hard_queue_list:            all.q

So clearing that might work. Do you have a job id you want to try that on?

For my "what triggered the panic" side of the coin:

Whats the level of I/O from these jobs? Remember the ENTIRE group of old nodes is constrained by a single 1Gbit line. So it needs to be very very low I/O.

Also right now the jobs marked "Eqw" could be restarted by issuing qmod -cj (jobid) but think about the above hard resource item.

tatarsky commented 6 years ago

Sorry, and in case you don't know the command I believe for resources you want to re-issue with the complete line of what you want. So something like:

qalter -l h_vmem=2.0G (jobid) may get it going.

And when I say "explain" if you want to see details of a queued item remember:

qstat -explain a -j (jobid)

tatarsky commented 6 years ago

And I may have to review if that "hard_queue" concept is a problem. I usually don't move items using that (we use resources to sort queues....) So if you give me a jobid to experiment with I can take a look as well.

joreynajr commented 6 years ago

One job in particular would have been an issue. All the ones named <sample_id>_combine_fastq. These are reading files which are about 150MB to 200MB and writing out an ~ 1GB file.

joreynajr commented 6 years ago

I attempted using something like the qalter command you used before but it didn't eliminate the opt resource. You test using the job 5029802.

joreynajr commented 6 years ago

Hey Paul,

On second thought. I think it would just be better to restart all of these jobs after checking out which only partially completed and need to be rerun. It's a little messier with all of the dependencies and I want to be on the safer side so I don't crash things.

tatarsky commented 6 years ago

Noted all these. (Sorry was getting a sandwich). Yeah, dependencies can make it more complex to qalter.

I'll look for a more specific cause of the kernel panic but basically the I/O on the old nodes is a bit fragile.

billgreenwald commented 6 years ago

Checking in on hub:

1) Admin has been updated properly. I can access the panel 2) The default for pdf is still not "view"

tatarsky commented 6 years ago

OK. I will follow up on number two above to #217 .