coreweave / ml-containers

MIT License
21 stars 3 forks source link

Add BLOOM image #1

Closed harubaru closed 1 year ago

harubaru commented 1 year ago

Adds a Dockerfile that sets up an environment for BLOOM training

Fixes https://github.com/coreweave/infra-pm/issues/332

github-actions[bot] commented 1 year ago

@salanki Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3511413854 Image: ghcr.io/coreweave/ml-containers/gpt-neox-determined:amercurio-bloom-image-4563711

github-actions[bot] commented 1 year ago

@salanki Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3511413844 Image: ghcr.io/coreweave/ml-containers/sd-finetuner:amercurio-bloom-image-4563711

github-actions[bot] commented 1 year ago

@salanki Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3511413845 Image: ghcr.io/coreweave/ml-containers/bloom:amercurio-bloom-image-4563711

salanki commented 1 year ago

Looking at https://github.com/jagwar/customAMI-benchmark/blob/bloom-AMI/nvidia-efa-ml-al2-enroot_pyxis.json#L298 it feels like stuff is missing, such as Megatron-DeepSpeed and bigscience. Maybe you didn't check it out since it's not part of the conda env? I think it makes sense to have it in the image regardless so everyrhing that is needed for training is there.

harubaru commented 1 year ago

Only issue is finding out where the repositories should sit inside of the image which was why I didn't originally include them, but running a git clone inside of the image should suffice, maybe directory structure should be setup in the image as well? Only reason why I didn't include cloning and installing the dependencies in the megatron repo was that the conda env had all of the dependencies for megatron anyway

BigScience is pulled here: https://github.com/jagwar/customAMI-benchmark/blob/bloom-AMI/create-conda-env.sh#L30 Megatron is pulled here also: https://github.com/jagwar/customAMI-benchmark/blob/bloom-AMI/create-conda-env.sh#L24

github-actions[bot] commented 1 year ago

@harubaru Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3528907726 Image: ghcr.io/coreweave/ml-containers/bloom:amercurio-bloom-image-bdba3a3

wbrown commented 1 year ago

@salanki is this ready for merge?

salanki commented 1 year ago

Almost. I have been using the branch myself. Have some changes I might make so let's leave it open.

wbrown commented 1 year ago

@salanki Still pending? Let's not leave this PR open forever. :)