aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.13k forks source link

Is cluster downsizing possible in distributed TrainingJobs? #1067

Closed jharrang closed 4 years ago

jharrang commented 4 years ago

Reference: 0413038650

System Information

Background:

My group is developing a BYOC Algorithm Resource. The final steps of our TrainingJob workflow only require one instance to run, but there are earlier steps that we'll be running distributed.

Question:

Is it possible to terminate some instances in a SageMaker TrainingJob cluster while other instances continue running? i.e if we run a TrainingJob with 10 instances, but the entrypoint scripts of 9 of those instances call sys.exit(0) while the single remaining instance continues to do work, will SageMaker:

 A. Stop billing for the 9 instances when they exit?
 B. Stop billing for the 9 instances when the TrainingJob completes
    (i.e. when the remaining instance exits)?
ChoiByungWook commented 4 years ago

Hello @jharrang,

Sorry for the late response.

Let me reach out to the corresponding team who handles the training platform for SageMaker and get back to you.

Reference: 0413038650

Thank you for your patience.

ishaaq commented 4 years ago

Hello @jharrang SageMaker Training jobs are billed from training start time to completion when the model is uploaded to S3.

So the correct answer to your question is (B) - i.e. the billing clock stops after the last instance exits and SageMaker has had a chance to clean up any outstanding work like uploading models, logs etc.

laurenyu commented 4 years ago

closing due to inactivity. feel free to reopen if necessary.