Goal/scope: gathering any relevant monitoring information related to a particular glidein slot, once the condor STARTD has already started. I am assuming that, if condor is on, we are going to use the ClassAd mechanism to distribute this monitoring information.
Some ideas for things to monitor here:
GPU benchmark: result of a short (1m) gpu benchmark run at glidein startup. It can be useful for enabling normalized accounting, and also for users to "filter out" super-slow GPUS, or things like this.
GPU utilization: might be tricky, but it would be nice to have a measurement of GPU utilization that reflects an "average" utilization for the job duration. May be we will need to poll nvidia-smi via STARTD_CRON and compute an average... don't know.
Goal/scope: gathering any relevant monitoring information related to a particular glidein slot, once the condor STARTD has already started. I am assuming that, if condor is on, we are going to use the ClassAd mechanism to distribute this monitoring information.
Some ideas for things to monitor here:
GPU benchmark: result of a short (1m) gpu benchmark run at glidein startup. It can be useful for enabling normalized accounting, and also for users to "filter out" super-slow GPUS, or things like this.
GPU utilization: might be tricky, but it would be nice to have a measurement of GPU utilization that reflects an "average" utilization for the job duration. May be we will need to poll nvidia-smi via STARTD_CRON and compute an average... don't know.
... others?