aws / deep-learning-containers

AWS Deep Learning Containers (DLCs) are a set of Docker images for training and serving models in TensorFlow, TensorFlow 2, PyTorch, and MXNet.
https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Other
995 stars 455 forks source link

[feature-request] Support S3 IO datapipes in PyTorch Training Base Image #3462

Closed awerchniak closed 6 months ago

awerchniak commented 11 months ago

Checklist

Concise Description: When attempting to launch a sagemaker.pytorch.estimator.PyTorch.fit job on the below-listed container that makes use of S3 IO datapipes, it fails immediately with:

UnexpectedStatusException: Error for Training job <job-name>: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise ModuleNotFoundError("TorchData must be built with BUILD_S3=1 to use this datapipe.")

The specific function we want to use is torchdata.datapipes.iter.S3FileLoader. The error occurs because the distribution of torchdata included in the image was not compiled with BUILD_S3=1. See full instructions here.

DLC image/dockerfile:

Is your feature request related to a problem? Please describe. Cannot use S3FileLoader with the PyTorch SageMaker base image

Describe the solution you'd like Could you please set BUILD_S3=1 when compiling torchdata, so that users can use this feature? Given that it's an AWS product offering, it's a good idea to encourage use of it for the SageMaker use case.

Describe alternatives you've considered Users can manually uninstall and re-install torchdata, but this requires the user to understand how to optimize the install for the specific platform.

Additional context N/A

tejaschumbalkar commented 6 months ago

@awerchniak There is a new connector support for S3: Amazon S3 Connector for PyTorch. Can you check it out and confirm if that satisfies your usecase?

rohit901 commented 6 months ago

I have the same requirement as the OP. I'm using a code that utilizes a library, which in turn uses S3FileLoader [from torchdata.datapipes.iter import S3FileLoader]

I'm having the same issue, and I checked out Amazon S3 Connector, however I think the API for S3 Connector is not the same as S3FileLoader, and this will lead us to spend more time to understand the library code and make changes everywhere.

Can you please make the API compatible so that we can replace calls of S3FileLoader, or build torchdata with S3 in the base containers?

Im using the following container: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.1-gpu-py310

sirutBuasai commented 6 months ago

@rohit901 PyTorch 2.0.1 is now out of support. We recommend upgrading to later version of PyTorch containers. See available_images.md for more information.

As of Nov, 2023 torchdata has paused development and its latest compatible version is PyTorch 2.1.1. We strongly recommend Amazon S3 Connector for PyTorch instead.