To host the model, this sample currently deploys a real-time SageMaker endpoint backed by GPU - which may be fine for high-volume use cases but probably pretty resource-intensive for many.
[x] Wait until async inference is supported in SageMaker Python SDK, rather than confusing the notebook with boto3 endpoint setup (tracking their issue and pull request)
[ ] Update notebook 2 to create an async endpoint with scale-to-zero capability
[ ] Update pipeline to correctly consume async endpoints
TBD: Do we need to retain a real-time deployment option for anybody that wants to optimize for low latency? Seems unnecessary to me at the moment
To host the model, this sample currently deploys a real-time SageMaker endpoint backed by GPU - which may be fine for high-volume use cases but probably pretty resource-intensive for many.
Since the end-to-end workflow here is asynchronous anyway (may have a human review component), it's probably a good case for new SageMaker asynchronous inference feature which supports scaling down to zero when demand is low.
TBD: Do we need to retain a real-time deployment option for anybody that wants to optimize for low latency? Seems unnecessary to me at the moment