aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

CDK-deployed thumbnailer and Tesseract OCR not using increased timeouts/limits #32

Closed athewsey closed 1 year ago

athewsey commented 1 year ago

When deploying the pipeline with alternative OCR (Tesseract) configuration I'm seeing both the SageMaker OCR and the thumbnailer endpoint are prone on large docs to internal errors due to payload size (e.g. HTTP 413 request too large) or timing out.

As mentioned on the SageMaker async developer guide:

If you're using a SageMaker provided container, you can increase the model server timeout and payload sizes from the default values to the framework‐supported maximums by setting environment variables in this step. You might not be able to leverage the maximum timeout and payload sizes that Asynchronous Inference supports if you don't explicitly set these variables.

I think there are two issues here:

  1. We seem to be relying on the SageMakerCustomizedDLCModel.max_payload_size parameter to set payload environment variables, but these have been hard-coded to the Multi-Model Server variants (used by HuggingFace and old PyTorch containers) instead of the TorchServe variants (used by current PyTorch containers): E.g. MMS_MAX_REQUEST_SIZE instead of TS_MAX_REQUEST_SIZE.
    • Currently both the thumbnailer and the SageMaker OCR option are configured to use PyTorch (v1.10) base container - so should use the TS_ variants instead.
    • We should probably remove this max_payload_size option altogether unless the SageMakerCustomizedDLCModel is able to detect the base framework/version of the image being used? Switching the statements to TS_... is a viable workaround but not a good practice in case they change back again in future or e.g. an alternative OCR option uses HuggingFace container as a base.
  2. The TS_DEFAULT_RESPONSE_TIMEOUT env var doesn't seem to be getting set anywhere on the generated models, which is particularly important for OCR as that can be slow.

From my tests so far, adding these TS_ environment variables to the CDK-built thumbnailer and OCR SageMakerCustomizedDLCModels seem to work as a workaround.