aws / sagemaker-mxnet-training-toolkit

Toolkit for running MXNet training scripts on SageMaker. Dockerfiles used for building SageMaker MXNet Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
60 stars 55 forks source link

Add mpi support for distributed training in MXNet #178

Closed ChaiBapchya closed 4 years ago

ChaiBapchya commented 4 years ago

Description of changes: Adds MPI support to distributed training

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ChaiBapchya commented 4 years ago

@ChuyangDeng FYI

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ChaiBapchya commented 4 years ago

@ChuyangDeng plz help with review. Thanks.

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ChaiBapchya commented 4 years ago

Integ test would require use of Sagemaker Python SDK for which we have an open PR https://github.com/aws/sagemaker-python-sdk/pull/1581

Can integ test be added later? or is there a way to add integration test without leveraging PythonSDK?

laurenyu commented 4 years ago

Your Python SDK change is simply sending over a hyperparameter, so you should be able to add an integ test here regardless of the Python SDK change. Here's an example with using parameter servers: https://github.com/aws/sagemaker-mxnet-training-toolkit/blob/master/test/integration/sagemaker/test_training.py#L30

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

chuyang-deng commented 4 years ago

LGTM. Skipping the generic horovod test following the practice of tensorflow-training-toolkit: https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/master/test/integration/local/test_horovod.py#L26

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ChaiBapchya commented 4 years ago

LGTM. Skipping the generic horovod test following the practice of tensorflow-training-toolkit: https://github.com/aws/sagemaker-tensorflow-training-toolkit/blob/master/test/integration/local/test_horovod.py#L26

Created an issue to track it: https://github.com/aws/sagemaker-mxnet-training-toolkit/issues/180

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented 4 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository