aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
205 stars 86 forks source link

Include full Dockerfile for FSDP example #503

Closed sean-smith closed 6 days ago

sean-smith commented 1 week ago

This fixes #491

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mnuyens commented 1 week ago

@sean-smith This change will make it so that the example can only be executed on hosts with cuda installed because GDRCopy fails if you don't. This will render it unuseable on many hosts that currently run the example.

mnuyens commented 6 days ago

@sean-smith if you update the Dockerfile to pin a few version the example works again

RUN pip install torchvision torchaudio transformers==4.46.1 datasets fsspec==2023.9.2 python-etcd numpy==1.* RUN pip install torch==2.5.1+cu121 --index-url https://download.pytorch.org/whl/cu121