It's based on MIG changes because MIG PR changes a bit the way the GPU are configured.
I added one parameter that set the number of sharding for the whole. The shard are evenly split between GPUs on the node. Initially, I wanted to set the set the shard number per GPU but it was complicated to configure, even more considering the MIG setup.
This PR add a new parameter to each infra to set the "shard" number similarly to the MIG configuration. If we prefer, we could set the shard number from profile::slurm::base directly with the hieradata instead.
It's based on MIG changes because MIG PR changes a bit the way the GPU are configured.
I added one parameter that set the number of sharding for the whole. The shard are evenly split between GPUs on the node. Initially, I wanted to set the set the shard number per GPU but it was complicated to configure, even more considering the MIG setup.
This PR add a new parameter to each infra to set the "shard" number similarly to the MIG configuration. If we prefer, we could set the shard number from
profile::slurm::base
directly with the hieradata instead.Related Puppet PR: https://github.com/ComputeCanada/puppet-magic_castle/pull/322