aws / sagemaker-pytorch-inference-toolkit

Toolkit for allowing inference and serving with PyTorch on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
131 stars 70 forks source link

Incorrect reporting of memory utilisation #141

Open david-waterworth opened 1 year ago

david-waterworth commented 1 year ago

Describe the bug I'm running into issues with batch transform due to what I assume is an OOM condition. The main problem appears to be because as far as I can see there's no way to explicitly configure the batch_size for a batch transform that I'm aware of.

Instead the batch_size appears to be controlled by MaxPayloadInMB which has a minimum of 1. I added logging in my predict_fn and observe that I'm receiving a mix of batches containing 1000 examples, and some that contain 10k+ examples. The huge batches are pretty much 1MB is size - I have no idea where the batches of 1000 come from (I'm wondering if its splitting the last batch that is less than the 1MB payload).

The issue is that the large batches seem to occasionally cause the worker to crash - I suspect it's an out-of-memory (the obvious workaround is to pick a machine with more memory). When I look at the logs the maximum utilisation appears to be around 50% - but looking closer that metric appears wrong, the example below has MemoryUsed=3537.828125 / MemoryAvailable=3843.3515625 = MemoryUtilization=50%

Expected behavior MemoryUtilization = 100.0 * MemoryUsed / MemoryAvailable

Screenshots or logs

2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:3843.3515625\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006
-- | --
  | 2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:3537.828125\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006
  | 2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:50.0\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006

System information A description of your system. Please provide:

Additional context Add any other context about the problem here.