Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Describe the bug
I'm running into issues with batch transform due to what I assume is an OOM condition. The main problem appears to be because as far as I can see there's no way to explicitly configure the batch_size for a batch transform that I'm aware of.
Instead the batch_size appears to be controlled by MaxPayloadInMB which has a minimum of 1. I added logging in my predict_fn and observe that I'm receiving a mix of batches containing 1000 examples, and some that contain 10k+ examples. The huge batches are pretty much 1MB is size - I have no idea where the batches of 1000 come from (I'm wondering if its splitting the last batch that is less than the 1MB payload).
The issue is that the large batches seem to occasionally cause the worker to crash - I suspect it's an out-of-memory (the obvious workaround is to pick a machine with more memory). When I look at the logs the maximum utilisation appears to be around 50% - but looking closer that metric appears wrong, the example below has MemoryUsed=3537.828125 / MemoryAvailable=3843.3515625 = MemoryUtilization=50%
Describe the bug I'm running into issues with batch transform due to what I assume is an OOM condition. The main problem appears to be because as far as I can see there's no way to explicitly configure the batch_size for a batch transform that I'm aware of.
Instead the batch_size appears to be controlled by
MaxPayloadInMB
which has a minimum of 1. I added logging in mypredict_fn
and observe that I'm receiving a mix of batches containing 1000 examples, and some that contain 10k+ examples. The huge batches are pretty much 1MB is size - I have no idea where the batches of 1000 come from (I'm wondering if its splitting the last batch that is less than the 1MB payload).The issue is that the large batches seem to occasionally cause the worker to crash - I suspect it's an out-of-memory (the obvious workaround is to pick a machine with more memory). When I look at the logs the maximum utilisation appears to be around 50% - but looking closer that metric appears wrong, the example below has MemoryUsed=3537.828125 / MemoryAvailable=3843.3515625 = MemoryUtilization=50%
Expected behavior MemoryUtilization = 100.0 * MemoryUsed / MemoryAvailable
Screenshots or logs
System information A description of your system. Please provide:
Additional context Add any other context about the problem here.