aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.02k stars 6.75k forks source link

[Content Improvement] Update instance types in pytorch_smdataparallel_mnist_demo #3494

Open enric1994 opened 2 years ago

enric1994 commented 2 years ago

Link to the notebook pytorch_smdataparallel_mnist_demo

What aspects of the notebook can be improved? This notebook is not working anymore with ml.p3dn.24xlarge instances : botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Unsupported instance type ml.p3dn.24xlarge

What are your suggestions? Don't suggest ml.p3dn.24xlarge as a recommended instance

jkroll-aws commented 2 years ago

I was able to successfully run this notebook using both the suggested ml.p3dn.24xlarge and ml.p4d.24xlarge instances in us-west-2. Which region are you using? Did you make any other code changes?

enric1994 commented 2 years ago

I am in the eu-central-1 (Frankfurt) region. I haven't changed the code.

saskra commented 1 year ago

I have the same problem in the same region as you with the following notebook: https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/efficientnet/pytorch_smdataparallel_efficientnet_demo.ipynb

saskra commented 1 year ago

If I use the other recommended instance, the result is no better: "UnexpectedStatusException: Error for Training job pt-smddp-efficientnet-b0-2p4d: Failed. Reason: ClientError: Requested instances are not available in these availability zones: [eu-central-1a]. Please try again with subnets having sufficient address space from a different AZ."

Interestingly, in the meantime, a badge saying "skipped" has been added to the notebook for all instances: https://github.com/aws/amazon-sagemaker-examples/blob/22b8203af35d91a1cbeb9a4d3c9c781ac74b24d6/training/distributed_training/pytorch/data_parallel/efficientnet/pytorch_smdataparallel_efficientnet_demo.ipynb?short_path=fce2eeb#L541

saskra commented 1 year ago

Apparently, you have to store the data in a file system on a different subnet. But which of the three to choose, you have to guess.