argonne-lcf / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
9 stars 12 forks source link

Develop pre-/mid-execution test harness #40

Open nscottnichols opened 3 months ago

nscottnichols commented 3 months ago

Per #37:

Another aspect to consider is to run periodic tests on the GPUs, such as hgemm and Igemm and any peer-to-peer tests, to see if there are any low-performing GPUs over time. So, in addition to running at the beginning of a job, we can run this after a specific duration, such as after every checkpoint.

Separated the issues to better track progress.

We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.