centerforaisafety / cerberus-cluster

HPC cluster code and configurations for running on OCI
Universal Permissive License v1.0
4 stars 0 forks source link

Consider adding ML tests #62

Open steven-safeai opened 1 year ago

steven-safeai commented 1 year ago

Such as: https://github.com/mlcommons/hpc

rumiah-safe commented 1 year ago

While the hpc benchmark seems pretty good, it is much more focused on scientific ml computing such as protein folding. While the performance is translatable as some of the benchmarks are using the same underlying math, there aren't any clusters of a similar hardware setup to us in the hpc benchmark. The general MLPerf benchmark however (and specifically version 2.1) have many clusters with very similar hardware setups to our own. Azure in particular open sourced their optimized config for our hardware. The performance benchmark is somewhat easy to run using docker and they were also using slurm as their resource manager, though it will undoubtably run into some issues when we end up trying to run it. Some of the 7 benchmarks listed are likely more relevant than others (view the first link below) and we should select from among them.

The general benchmark also has more systems generally to compare to and when we switch to H100 we should run their next edition (will happen in 6 months) as it gives us a good amount of credibility to be listed there.

Putting this aside for now to focus on projects for the roadmap.

Here are all the useful links I have found thus far: https://arxiv.org/abs/1910.01500 https://github.com/mlcommons/training https://mlcommons.org/en/training-normal-30/ https://www.hpcwire.com/2022/11/14/mlcommons-issues-mlperf-hpc-training-results-for-larger-systems/

Azure bert https://github.com/mlcommons/training_results_v2.1/tree/main/Azure-HazyResearch/benchmarks/bert/implementations/ND96amsr_A100_v4 https://www.youtube.com/watch?v=txtvMhzEDu8