Develop pre-/mid-execution test harness

Per #37:

Another aspect to consider is to run periodic tests on the GPUs, such as hgemm and Igemm and any peer-to-peer tests, to see if there are any low-performing GPUs over time. So, in addition to running at the beginning of a job, we can run this after a specific duration, such as after every checkpoint.

Separated the issues to better track progress.

We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.

argonne-lcf / Megatron-DeepSpeed

Develop pre-/mid-execution test harness #40