Open jonathanng opened 4 months ago
This is the solution I came up with so far:
try:
predictor.fit(
predictor_init_args = predictor_init_args,
predictor_fit_args = predictor_fit_args,
instance_type = instance_type,
wait = True)
except botocore.exceptions.ClientError:
info = predictor.info()
job_name = info['fit_job']['name']
while True:
try:
predictor = autogluon.cloud.TabularCloudPredictor()
predictor.attach_job(job_name=job_name)
job_status = predictor.get_fit_job_status()
except botocore.exceptions.ClientError:
session = _get_session()
else:
if job_status in ('InProgress', 'Stopping'):
time.sleep(60 * 5)
elif job_status == 'Completed':
break
elif job_status in ('Failed', 'Stopped', 'NotCreated'):
raise ValueError(f'Job {job_name} {job_status}')```
I want to run
TabularCloudPredictor
withwait=True
, but getting the error:I was hoping I could trust extend the role duration with
DurationSeconds
:But according to this:
https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html
which is 12 hours.
My model trains usually take longer than 12 hours.
Has anyone found an elegant solution for this?