ComputeCanada / magic_castle

Terraform modules to replicate the HPC user experience in the cloud
MIT License
124 stars 36 forks source link

Sharding GPU support #289

Open etiennedub opened 8 months ago

etiennedub commented 8 months ago

It's based on MIG changes because MIG PR changes a bit the way the GPU are configured.

I added one parameter that set the number of sharding for the whole. The shard are evenly split between GPUs on the node. Initially, I wanted to set the set the shard number per GPU but it was complicated to configure, even more considering the MIG setup.
This PR add a new parameter to each infra to set the "shard" number similarly to the MIG configuration. If we prefer, we could set the shard number from profile::slurm::base directly with the hieradata instead.

Related Puppet PR: https://github.com/ComputeCanada/puppet-magic_castle/pull/322

cmd-ntrf commented 6 months ago

@etiennedub Can you rebase this and fix conflict now that MIG PRs have been merged?