aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.03k stars 6.75k forks source link

[Bug Report] Unable to utilize multiple instances in sagemaker batch transform request #3134

Closed grantdelozier closed 2 years ago

grantdelozier commented 2 years ago

Describe the bug

Throughout sagemaker batch transform documentation it is suggested that multiple instances can be utilized to fulfill inference requests. API documentation for createTransformJob accepts a parameter called InstanceCount

However whenever I create transform jobs which utilize more than one instance, only one instance is actually utilized for fulfilling inferences. I can see through logs that multiple instances are started, but only one instance is used to fulfill requests.

It looks as though someone else noticed this previously, but the issue was closed without being resolved. In this thread @djarpin suggests that multiple instances will be utilized if multiple input files are utilized. However, this still doesn't seem to work either. If you include a folder with multiple files in your TransformInput argument, it will utilize both files but still only send all invocations to a single instance.

To reproduce

Invoke a createTransformJob() request with

TransformResources: {
        InstanceCount: instanceCount,
        InstanceType: model.defaultInstanceType
      },

where instanceCount > 1. In cloudwatch observe that all invocations are sent to a single instance while all other instance sit idle.

Here is a full list of the parameters I include in my createTransformJob() request

const params = {
      ModelName: model.sagemakerEndpoint,
      TransformInput: {
        ContentType: 'application/json',
        DataSource: {
          S3DataSource: {
            S3DataType: 'S3Prefix',
            S3Uri: 's3://'+ inferenceJob.s3Bucket + '/' + inferenceJob.s3InferenceArgsPath,
          }
        },
        SplitType: 'Line'
      },
      TransformJobName: inferenceJob.sagemakerJobName,
      TransformOutput: {
        S3OutputPath: 's3://'+ inferenceJob.s3Bucket + '/' + inferenceJob.s3InferenceOutputPath,
        Accept: 'application/json',
        AssembleWith: 'Line',
      },
      TransformResources: {
        InstanceCount: 2,
        InstanceType: model.defaultInstanceType
      },
      ModelClientConfig: {InvocationsMaxRetries: 0},
      BatchStrategy: 'SingleRecord',
      MaxConcurrentTransforms: 1,
      Tags: [
        {
          Key: 'ModelName',
          Value: model.name
        },
      ]
    }
usbhub commented 2 years ago

I ran into this same problem and I'm really surprised that the behavior is like this where it assigns a whole file to a host. This could also cause more subtle performance issues like if you have some files that are much larger than others it won't be immediately obvious there's an issue because the other hosts will still be doing some work. When I split up the files one per host it did work as expected for me though, but as was said in the previous thread the sharding should happen at a record/batch level not a file level

jholmes-godaddy commented 2 years ago

@grantdelozier , could you clarify why this was closed? I am running into the same behavior: only one instance being used, even when the number of input files greatly exceeds the number of instances.

dwhite54 commented 1 year ago

@jholmes-godaddy it looks like this is a feature not a bug.

The solution is to split your input file into multiple pieces, though it seems that @grantdelozier also had trouble with that route.

grantdelozier commented 1 year ago

The short answer to why I closed this issue is that it stopped happening to me. I deleted and re-created my sagemaker model artifact, rebuilt my inference container on ECR, and double triple checked that my batch inference invocation parameters were correct, confirming through the Sagmaker batch inference management UI that I had given parameters and arguments correctly.

After doing this, everything started working as expected when i specified an instancecount > 1.

So I guess I had simply misconfigured something. I would encourage others struggling with this issue to go through the whole process of creating the sagemaker model, ECR image, and batch transform to double verify that everything has been set up correctly.

Sandy4321 commented 1 year ago

can you share work flow how you did it?