aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Extend documentation regarding distributed training for own Docker containers. #218

Open marseller opened 2 months ago

marseller commented 2 months ago

What did you find confusing? Please describe. I was searching for documentation regarding distributed training with own docker containers. The current documentation explains how to create containers or extend them to be able to use distributed training with the required modules installation guide , but its does not provide information on how to configure the Estimator class or any other launch parameters to start the distributed training as it does for PyTorch or Tensorflow classes.

Describe how documentation can be improved Add text that describe how to launch the distributed training after creating or extending the docker image. Do it at these sections: https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api (here is a typo in the link that you should also fix, skd instead of sdk) https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-bring-your-own-container