aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Silent Failure if custom image puts something into /opt/ml/code #222

Open njbrake opened 2 months ago

njbrake commented 2 months ago

Hi, I was making a new Docker image for training:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04
COPY src/requirements.txt /opt/ml/code/requirements.txt
RUN pip install --no-cache-dir -r /opt/ml/code/requirements.txt

And I found that when I do that, my training image could no longer find the files that usually get copied in when the container runs. I traced it back to this line, which checks if the /opt/ml/code folder exists, and if it exists at all it just skips the step that copies over the sourcedir.tar.gz file from that URI.

Should the logic be changed so that it doesn't skip downloading the file, or maybe at least it should give a warning that it's skipping the download?

https://github.com/aws/sagemaker-training-toolkit/blob/628166c157751ae2a46fddc11a7a8cac765fb22c/src/sagemaker_training/files.py#L134