aws-samples / amazon-sagemaker-tensorflow-object-detection-api

Train and deploy models using TensorFlow 2 with the Object Detection API on Amazon SageMaker
MIT No Attribution
44 stars 34 forks source link

Question : how/when is source_dir copied into the sagemaker training instance? #18

Closed mkabatek closed 2 years ago

mkabatek commented 2 years ago

Hello,

I am able to successfully get this example working no problem. I can adapt it to my purpose. However I am now attempting to use the AWS SageMaker Javascript SDK in order to accomplish the same task. I don't quite understand how source_dir in the jupyter notebook instance gets transferred over to the sagemaker training instance.

Is this done by the python sagemaker SDK? Can someone comment on how this could be done via the javascript sagemaker SDK?

The following successfully launches a trading instance. The image is downloaded from ECR, however the training fails. I suspect it is because source_dir has not been copied over to the sagemaker training instance.

 const trainingDateTime = moment().utc().format('YYYY-MM-DD-hh-mm-ss-SSS')
        let roleArn = 'arn:aws:iam::XXXX:role/sagemaker_role_dev'
        let TrainingJobName = `tf2-object-detection-${trainingDateTime}`
        let TrainingImage   = 'XXXX.dkr.ecr.us-west-2.amazonaws.com/tf-object-detection:XXXX'
        let S3Uri           = 's3://tf-training-test/data/antenna/tfrecords/'

        let params = {
            AlgorithmSpecification: { /* required */
                TrainingInputMode: 'File', /* required */
                TrainingImage: TrainingImage
            },
            OutputDataConfig: { /* required */
                S3OutputPath: `s3://sagemaker-us-west-2-XXXX/`, /* required */
            },
            ResourceConfig: { /* required */
                InstanceCount: 1, /* required */
                InstanceType: 'ml.p3.2xlarge', /* required */
                VolumeSizeInGB: 30, /* required */
            },
            RoleArn: roleArn, /* required */
            StoppingCondition: { /* required */
                MaxRuntimeInSeconds: 86400
            },
            TrainingJobName: TrainingJobName, /* required */
            InputDataConfig: [
                {
                    ChannelName: 'train', /* required */
                    DataSource: { /* required */
                        S3DataSource: {
                            S3DataType: 'S3Prefix', /* required */
                            S3Uri: S3Uri, /* required */
                            S3DataDistributionType: 'FullyReplicated'
                        }
                    },
                    CompressionType: null,
                    ContentType: '',
                    RecordWrapperType: null,
                }
            ],
            HyperParameters: {
                'model_dir':                        '"/opt/training"',
                'num_train_steps':                  '500',
                'pipeline_config_path':             '"pipeline.config"',
                'sagemaker_container_log_level':    '20',
                'sagemaker_job_name':               `tf2-object-detection-${trainingDateTime}`,
                'sagemaker_program':                '"run_training.sh"',
                'sagemaker_region':                 '"us-west-2"',
                'sagemaker_submit_directory':       `"s3://sagemaker-us-west-2-XXXX/${TrainingJobName}/source/sourcedir.tar.gz"`,
                'sample_1_of_n_eval_examples':      '1',
            },
            TensorBoardOutputConfig: {
                S3OutputPath: 's3://tf-training-test/data/antenna/tensorboard/', /* required */
                LocalPath: '/opt/training/'
            }
        };

        return this.sagemaker.createTrainingJob(params).promise();

Sagemaker training fails with the following error

2021-10-08 18:30:19,173 sagemaker-training-toolkit ERROR    framework error: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_training/trainer.py", line 97, in train
    runner_type=runner_type,
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_training/entry_point.py", line 92, in run
    files.download_and_extract(uri=uri, path=environment.code_dir)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_training/files.py", line 131, in download_and_extract
    s3_download(uri, dst)
  File "/usr/local/lib/python3.6/dist-packages/sagemaker_training/files.py", line 167, in s3_download
    s3.Bucket(bucket).download_file(key, dst)
  File "/usr/local/lib/python3.6/dist-packages/boto3/s3/inject.py", line 247, in bucket_download_file
    ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
  File "/usr/local/lib/python3.6/dist-packages/boto3/s3/inject.py", line 173, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python3.6/dist-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/usr/local/lib/python3.6/dist-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
  File "/usr/local/lib/python3.6/dist-packages/s3transfer/tasks.py", line 255, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/s3transfer/download.py", line 343, in _submit
    **transfer_future.meta.call_args.extra_args
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 705, in _make_api_call
    raise error_class(parsed_response, operation_name)

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

Any insight on what causes this error or insights on how to get this same example functioning in the Sagemaker Javascript SDK would be greatly appreciated.

sofianhamiti commented 2 years ago

Hi @mkabatek, this more a SageMaker Python SDK question and I suggest you ask there, on AWS re:Post, or on Stack Overflow.