EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

https://www.eleuther.ai/

Apache License 2.0

6.94k stars 1.01k forks source link

Hosted Github Runners for CI #531

Open Mistobaan opened 2 years ago

Mistobaan commented 2 years ago

Overview

In order to test effectively any changes to the codebase using the full cuda / mpi / apex stack of the repository, it would be nice to dedicate some resources of the cluster to hosted runners similar in how deepspeed tests its own code base.

[ ] check the feasibility in terms of hardware resources. Even a spot instance should be enough.
[ ] create the github workflow

StellaAthena commented 2 years ago

This is something we should be able to set up in the next couple weeks. Are you familiar with setting up such a hosted runner?

Mistobaan commented 2 years ago

I can figure out the details, it really depends on what hardware we have available, if cloud / bare metal or k8s.

StellaAthena commented 2 years ago

I can figure out the details, it really depends on what hardware we have available, if cloud / bare metal or k8s.

k8s, building from a Docker file. There’s info on our Docker file here

StellaAthena commented 2 years ago

@Mistobaan Based on our recent conversations, I'm currently under the impression that the code works and now we just need to allocate a dedicated GPU cluster and set up the CI. Is that correct? If so, I can set up a dedicated GPU cluster and we can start testing the CI.