aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

fix: increase worker waiting time for ORTE proc #178

Closed yl-to closed 1 year ago

yl-to commented 1 year ago

Issue #, if available:

Description of changes: In current mpi set up, every worker node will wait for 5 minutes at maximum for master node to start ORTE process, when a worker node uses more than 5 minutes to start, master node will wait for it, but other node will start error out because no ORTE process found, then the whole training job will fail. In practice, large cluster is failing because of above reason because some nodes used more than 5 minutes to start. This PR is raising the waiting time from 5 min to 20 min as suggested by sagemaker platform team.

Also, a newer sagemaker python SDK version is blocking our pipeline, a workaround here is pin it to an earlier version. We will get back to this later after python SDK team had a fix.

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

Tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

yl-to commented 1 year ago

Add more details in the description.

updated.