Description of changes:
In current mpi set up, every worker node will wait for 5 minutes at maximum for master node to start ORTE process, when a worker node uses more than 5 minutes to start, master node will wait for it, but other node will start error out because no ORTE process found, then the whole training job will fail.
In practice, large cluster is failing because of above reason because some nodes used more than 5 minutes to start.
This PR is raising the waiting time from 5 min to 20 min as suggested by sagemaker platform team.
Also, a newer sagemaker python SDK version is blocking our pipeline, a workaround here is pin it to an earlier version. We will get back to this later after python SDK team had a fix.
Testing done:
Merge Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.
Issue #, if available:
Description of changes: In current mpi set up, every worker node will wait for 5 minutes at maximum for master node to start ORTE process, when a worker node uses more than 5 minutes to start, master node will wait for it, but other node will start error out because no ORTE process found, then the whole training job will fail. In practice, large cluster is failing because of above reason because some nodes used more than 5 minutes to start. This PR is raising the waiting time from 5 min to 20 min as suggested by sagemaker platform team.
Also, a newer sagemaker python SDK version is blocking our pipeline, a workaround here is pin it to an earlier version. We will get back to this later after python SDK team had a fix.
Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.