aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.1k stars 1.14k forks source link

PyTorch Estimator `max_run` parameter not working at all #4451

Open yiphei opened 8 months ago

yiphei commented 8 months ago

Describe the bug I tried to use the max_run parameter of sagemaker.pytorch.estimator.PyTorch to define the max run time in seconds, but it doesnt work. See the attached screenshot for an example. In the screenshot, I set max_run to be 603 seconds. But it didnt stop at 603, evidenced by the training time at 841s (at which I manually terminated the run)

Screenshot 2024-02-23 at 6 47 13 PM

To reproduce Just set max_run of sagemaker.pytorch.estimator.PyTorch to be any integer value

Expected behavior I expect the sagemaker training run to terminate when it has elapsed the seconds set in max_run

Screenshots or logs See screenshot in description

System information A description of your system. Please provide:

Additional context NA

sarseniy commented 3 weeks ago

+1 Can anybody explain how this works?