Another aspect to consider is to run periodic tests on the GPUs, such as hgemm and Igemm and any peer-to-peer tests, to see if there are any low-performing GPUs over time. So, in addition to running at the beginning of a job, we can run this after a specific duration, such as after every checkpoint.
Separated the issues to better track progress.
We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.
Per #37:
Separated the issues to better track progress.
We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.