TACC / tacc_stats

TACC Stats is an automated resource-usage monitoring and analysis package.
GNU Lesser General Public License v2.1
41 stars 15 forks source link

Support multiple jobs on the same node #41

Open stephenlienharrell opened 1 year ago

stephenlienharrell commented 1 year ago

Currently we collect everything at a node-level. We need to examine what metrics can be split out (on a core or socket basis) and what is not able to be split out and if that is useful.

stephenlienharrell commented 1 year ago

for CPU need core-affinity matched to job id

For Memory: Need to find all memory usage from primary job starter programmatically. Find job starter, then get all child process memory: ps -o pid,ppid,pgid,comm,%cpu,%me

Snapshot this at the same time as the rest of the metrics - find out if there is a way to get the job id, then match jobid to specific processes on-node to get snapshot of memory usage.

Can we do this programmatically for any other statistics?

stephenlienharrell commented 1 year ago

regarding the approach above, need to make sure we can capture detached processes