Distributed ML training with a large number (100+) of GPUs

Context

GPUs are a limited resource at many cloud vendors. Even more Challenges include that these resources is finely grained between specific GPU models and not uniformily distributed across regions and zones. When we have requested additional GPUs from cloud vendors, we typically receive only 1-4 GPUs per project.

There have been requests from our community partners (e.g. m2lines and LEAP) to really stretch the boundaries of science with innovations in distributed machine learning. For that work, a single user may want to request up to a 100 GPUs at one time. A common argument made in advocating for cloud based science is that the resources is effectively unbounded. However, in practice, our communities are finding that GPUs are in fact a very limited resources. This can be a source of friction for adoption of platforms like 2i2c for advanced research computing.

I am opening this issue as a starting point for improving the access to GPUs for communities, making more explicit what limitations currently exist, and what possible development 2i2c could seeking funding for to improve the situation.

Proposal

No response

Updates and actions

No response

2i2c-org / infrastructure