Closed harubaru closed 1 year ago
@salanki Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3511413854
Image: ghcr.io/coreweave/ml-containers/gpt-neox-determined:amercurio-bloom-image-4563711
@salanki Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3511413844
Image: ghcr.io/coreweave/ml-containers/sd-finetuner:amercurio-bloom-image-4563711
@salanki Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3511413845
Image: ghcr.io/coreweave/ml-containers/bloom:amercurio-bloom-image-4563711
Looking at https://github.com/jagwar/customAMI-benchmark/blob/bloom-AMI/nvidia-efa-ml-al2-enroot_pyxis.json#L298 it feels like stuff is missing, such as Megatron-DeepSpeed and bigscience. Maybe you didn't check it out since it's not part of the conda env? I think it makes sense to have it in the image regardless so everyrhing that is needed for training is there.
Only issue is finding out where the repositories should sit inside of the image which was why I didn't originally include them, but running a git clone inside of the image should suffice, maybe directory structure should be setup in the image as well? Only reason why I didn't include cloning and installing the dependencies in the megatron repo was that the conda env had all of the dependencies for megatron anyway
BigScience is pulled here: https://github.com/jagwar/customAMI-benchmark/blob/bloom-AMI/create-conda-env.sh#L30 Megatron is pulled here also: https://github.com/jagwar/customAMI-benchmark/blob/bloom-AMI/create-conda-env.sh#L24
@harubaru Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/3528907726
Image: ghcr.io/coreweave/ml-containers/bloom:amercurio-bloom-image-bdba3a3
@salanki is this ready for merge?
Almost. I have been using the branch myself. Have some changes I might make so let's leave it open.
@salanki Still pending? Let's not leave this PR open forever. :)
Adds a Dockerfile that sets up an environment for BLOOM training
Fixes https://github.com/coreweave/infra-pm/issues/332