ACI-REF-Virtual-Residency / questions-and-answers

A place for questions and answers
1 stars 0 forks source link

Performance Monitoring #5

Open rushgeo opened 8 years ago

rushgeo commented 8 years ago

I thought I'd try one in the "collecting knowledge to do our jobs better" category. While I could probably ask this question on Stack Exchange, it helps register an item of interest here that might make it into the wiki or other docs we write up. :rainbow:

I had a user whose code ran slower on our cluster than his desktop. It turns out it's mostly serial code, and we have less MHz than his desktop. We could chunk up his study area into 20 pieces and run all 20 simultaneously (one for each core on a node), but the catch is that one part of his model is parallel. All 20 copies reach that part about the same time, and then fight for all 20 of the cores.

The only way I knew to verify this behavior was watching top while it was running. Is there a better way to analyze CPU usage over time?

We use Torque and Maui on Centos 6.6.

zanewgray commented 8 years ago

We use Icinga (https://www.icinga.org) to monitor both hypervisor and VM resources in our private cloud datacenter. I would assume that such a tool could also be leveraged to monitor and provide historical utilization across a supercomputing environment as well. As with any monitoring utility, bear in mind both the performance penalty due to the polling (generally sub-percentages), as well as impacts due to contention (i.e. "high priority" tasks may keep the nodes from responding to monitoring polls, which leaves gaps in the performance graphs).

Hope that is close to what you are looking for...