aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Pass args to training script entrypoint for MPI-based Distributed training #69

Open ChaiBapchya opened 4 years ago

ChaiBapchya commented 4 years ago

Describe the feature you'd like Pass arguments to the training script while using Horovod via MPI for Distributed training.

Current Situation Only ProcessRunner supports passing hyperparameters https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/process.py#L105-L109

MPIRunner doesn't support it. MPIRunner supports it: https://github.com/aws/sagemaker-training-toolkit/blob/c357433d6fdbc43a896b25bd126c46f689ddb73c/src/sagemaker_training/mpi.py#L41-L45

How would this feature be used? Please describe. Example API would be

mpi_options = '-verbose -x orte_base_help_aggregate=0'
estimator = MXNet(
    entry_point='hvd_resnet_mx.sh',
    role=role,
    train_instance_type='ml.p3.8xlarge',
    train_instance_count=2,
    image_name=image,
    framework_version='1.6.0',
    py_version='py3',
    hyperparameters={'sagemaker_mpi_enabled': True,
                     'sagemaker_mpi_custom_mpi_options': mpi_options,
                     'sagemaker_mpi_num_of_processes_per_host': 4},
    sagemaker_session=sagemaker_session)

Where entry-point script is hvd_resnet_mx.sh

! pygmentize hvd_resnet_launcher.sh
./hvd_resnet_mx.py --num-epochs 5

Describe alternatives you've considered Right now, one has to use ProcessRunner instead of MPIRunner to pass bash script for training

estimator = MXNet(
    entry_point='hvd_resnet_launcher.sh',
    role=role,
    train_instance_type='ml.p3.8xlarge',
    train_instance_count=2,
    image_name=image,
    framework_version='1.6.0',
    py_version='py3',
    hyperparameters={'sagemaker_parameter_server_enabled': True
                    },
    sagemaker_session=sagemaker_session)
ChaiBapchya commented 4 years ago

@ChuyangDeng @laurenyu

chuyang-deng commented 4 years ago

Discussed with @ChaiBapchya offline, https://github.com/aws/sagemaker-training-toolkit/blob/master/src/sagemaker_training/mpi.py#L43 might be what he needs.

However, it's difficult to find an example of the usage of args

ChaiBapchya commented 4 years ago

Documentation part is missing. Can someone help with adding that? Maybe on-call?