aws-samples / amazon-sagemaker-tensorflow-object-detection-api

Train and deploy models using TensorFlow 2 with the Object Detection API on Amazon SageMaker
MIT No Attribution
45 stars 34 forks source link

Error while creating TFRecord in prepare_data.ipynb #16

Closed Abhishek-08 closed 2 years ago

Abhishek-08 commented 3 years ago

I am trying to run the sample file to prepare the bees dataset and have run every cell until the tfrecord conversion step without any issues. I am getting the following error when I try to run the cell

UnexpectedStatusException Traceback (most recent call last)

in 22 output_name='tfrecords', 23 source=output_folder, 24 destination=f's3://{bucket}/data/bees/tfrecords' 25 ) 26 ] ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/processing.py in run(self, inputs, outputs, arguments, wait, logs, job_name, experiment_config, kms_key) 170 self.jobs.append(self.latest_job) 171 if wait: 172 self.latest_job.wait(logs=logs) 173 174 def _extend_processing_args(self, inputs, outputs, **kwargs): # pylint: disable=W0613 ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/processing.py in wait(self, logs) 854 """ 855 if logs: 856 self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True) 857 else: 858 self.sagemaker_session.wait_for_processing_job(self.job_name) ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in logs_for_processing_job(self, job_name, wait, poll) 3453 3454 if wait: 3455 self._check_job_status(job_name, description, "ProcessingJobStatus") 3456 if dot: 3457 print() ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 2955 ), 2956 allowed_statuses=["Completed", "Stopped"], 2957 actual_status=status, 2958 ) 2959 UnexpectedStatusException: Error for Processing job tf2-object-detection-2021-07-29-09-51-30-040: Failed. Reason: ClientError: API error (404): manifest for 222599357734.dkr.ecr.us-east-1.amazonaws.com/tfrecord-processing:20210729094117 not found: manifest unknown: Requested image not found
timjell commented 2 years ago

The issue is earlier in the document, when it builds the docker image, then pushes it to the ECR Registry. The default IAM Role, which is called something like AmazonSageMaker-ExecutionRole-xxxxxxx (with some numbers at the end) does not have permission to 'Write' to ECR, specifically the 'InitiateLayerUpload' permission.

If you check on CloudWatch you will see the attempts, and selecting one of the events will show a message with:

"errorMessage": "User: arn:aws:sts:::assumed-role/AmazonSageMaker-ExecutionRole-/SageMaker is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:ap-southeast-2::repository/tfrecord-processing because no identity-based policy allows the ecr:InitiateLayerUpload action"

I created a new policy all the 'Write' permissions, and attached it to the AmazonSageMaker-ExecutionRole. Then rerun the whole notebook, and the docker push to ECR should complete.

sofianhamiti commented 2 years ago

Hi @Abhishek-08, i have updated the repo with new TF version. Can you have a try and confirm it works now?