aws / sagemaker-python-sdk

A library for training and deploying machine learning models on Amazon SageMaker
https://sagemaker.readthedocs.io/
Apache License 2.0
2.09k stars 1.13k forks source link

sagemaker job failing in transformation step - Factorization Machines #962

Closed a-torrano-m closed 3 years ago

a-torrano-m commented 5 years ago

Reference: SMAlgo-314

Please fill out the form below.

System Information

Describe the problem

We are aiming to produce recommendations using sagemaker with factorization machines. We feed the model with a sparse matrix of 45000 rows and 15000 columns. Training completes successfully. The batch transformation stage crashes during the wait(), the exception redirects to read the logs. The message is : “Unable to get response from algorithm.”

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

EXCEPTION OUTPUT:

ValueError Traceback (most recent call last)

in () 13 print(datetime.datetime.now().time()) 14 ---> 15 fmTr.wait() 16 print(datetime.datetime.now().time()) ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self) 205 def wait(self): 206 self._ensure_last_transform_job() --> 207 self.latest_transform_job.wait() 208 209 def _ensure_last_transform_job(self): ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self) 304 305 def wait(self): --> 306 self.sagemaker_session.wait_for_transform_job(self.job_name) 307 308 @staticmethod ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_transform_job(self, job, poll) 1004 """ 1005 desc = _wait_until(lambda: _transform_job_status(self.sagemaker_client, job), poll) -> 1006 self._check_job_status(job, desc, "TransformJobStatus") 1007 return desc 1008 ~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 1026 reason = desc.get("FailureReason", "(No reason provided)") 1027 job_type = status_key_name.replace("JobStatus", " job") -> 1028 raise ValueError("Error for {} {}: {} Reason: {}".format(job_type, job, status, reason)) 1029 1030 def wait_for_endpoint(self, endpoint, poll=5): ValueError: Error for Transform job factorization-machines-2019-08-01-09-40-45-581: Failed Reason: InternalServerError: We encountered an internal error. Please try again. LOG MESSAGE : 2019-08-01T09:44:02.787:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD 2019-08-01T09:45:48.275:[sagemaker logs]: (...bucket and key...)/BATCH_jobName.csv000.json: Unable to get response from algorithm - **Exact command to reproduce**: ```python fmTr = fm.transformer( instance_count=1, instance_type='ml.c4.xlarge', # 'ml.m4.xlarge', strategy='MultiRecord', assemble_with='Line', output_path= 's3://'+bucket+'/'+outputPath) fmTr.transform(batch_input_s3, content_type='application/json', split_type='Line') fmTr.wait() ```
yastasho commented 5 years ago

Hi a-torrano-m, could you please share training job hyperparameters as well?

a-torrano-m commented 5 years ago

Hi yatasho, here they are: fm.set_hyperparameters(feature_dim =numFeatures, predictor_type ='binary_classifier', mini_batch_size =1000, num_factors =64, epochs =100)

a-torrano-m commented 5 years ago

Could this hyperparameters be tested? Is the error reproducible?

thanks

a-torrano-m commented 5 years ago

Could some reason be found for the issue?

thanks

yastasho commented 5 years ago

Hi a-torrano-m, the error is reproducible. We will work on a fix. Thanks for reporting the issue.

a-torrano-m commented 5 years ago

Thanks yatasho! have you produced some "jira-ticket" or issue code we could read to follow up how is it advancing? otherwise, we will wait the news in this thread if you send any message. thanks very much!