TACC / tacc_stats

TACC Stats is an automated resource-usage monitoring and analysis package.
GNU Lesser General Public License v2.1
41 stars 15 forks source link

How many GPUs used counter #45

Open stephenlienharrell opened 1 year ago

stephenlienharrell commented 1 year ago

We need a counter on the new version that says how many GPUs were used for a job.

stephenlienharrell commented 8 months ago

need to separate gpu counter data in order to implement this correctly

nicejunjie commented 8 months ago

preliminary implementation done and online for LS6. limitations: 1) raw data for individual GPUs are merged in the database when imported, so only the total percentage is availlable. 2) a few nodes in gpu-a100-small and gpu-dev seems don't have gpu recording enabled by the monitor, no gpu data is recorded, e.g. : https://ls6-stats.tacc.utexas.edu/machine/job/1473810/

Possible workaround without changing database stucture: make "event" to be "utilization_$gpunumber" instead of "utilization" when importing, then extract "$gpunumber" in views.py.