runtimeError: 507 - System out of memory sagemaker

DevBey commented 2 years ago

Hi, faced the above error when trying to deploy xlm-roberta of flair using inf1.xlarge instance, the deployment kept failing and worker kept dying.

when I changed the instance to inf1.2xlarge the error went away and model got deployed !!

2022-05-05T00:09:46.882+05:30 | 2022-05-04T18:39:46,451 [INFO ] pool-2-thread-7 ACCESS_LOG - /10.32.0.2:33848 "GET /ping HTTP/1.1" 200 1
-- | --
  | 2022-05-05T00:09:46.882+05:30 | 2022-05-04T18:39:46,452 [INFO ] pool-2-thread-7 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:model.aws.local,timestamp:null
  | 2022-05-05T00:09:47.882+05:30 | 2022-05-04T18:39:47,322 [INFO ] W-9000-model_1-stdout MODEL_LOG - 2022-05-04 18:39:47,322 loading file /home/model-server/tmp/models/a32d397d321f4974bec946c6380b18e9/neuron_compiled_model_NER_Flair_SequenceTagger.pt
  | 2022-05-05T00:09:51.883+05:30 | 2022-05-04T18:39:51,451 [INFO ] pool-2-thread-7 ACCESS_LOG - /10.32.0.2:33848 "GET /ping HTTP/1.1" 200 1
  | 2022-05-05T00:09:51.883+05:30 | 2022-05-04T18:39:51,451 [INFO ] pool-2-thread-7 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:model.aws.local,timestamp:null
  | 2022-05-05T00:09:56.885+05:30 | 2022-05-04T18:39:56,451 [INFO ] pool-2-thread-7 ACCESS_LOG - /10.32.0.2:33848 "GET /ping HTTP/1.1" 200 1
  | 2022-05-05T00:09:56.885+05:30 | 2022-05-04T18:39:56,453 [INFO ] pool-2-thread-7 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:model.aws.local,timestamp:null
  | 2022-05-05T00:10:01.894+05:30 | 2022-05-04T18:40:01,487 [INFO ] pool-2-thread-7 ACCESS_LOG - /10.32.0.2:33848 "GET /ping HTTP/1.1" 200 3
  | 2022-05-05T00:10:01.898+05:30 | 2022-05-04T18:40:01,488 [INFO ] pool-2-thread-7 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:model.aws.local,timestamp:null
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,175 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Backend response time: 20112
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,175 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,176 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,176 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.7/site-packages/ts/model_service_worker.py", line 189, in <module>
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,176 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,176 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.7/site-packages/ts/model_service_worker.py", line 161, in run_server
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,177 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket)
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,177 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.7/site-packages/ts/model_service_worker.py", line 128, in handle_connection
  | 2022-05-05T00:10:04.899+05:30 | 2022-05-04T18:40:04,177 [INFO ] W-9000-model_1-stdout MODEL_LOG - raise RuntimeError("{} - {}".format(code, result))
  | 2022-05-05T00:10:04.900+05:30 | 2022-05-04T18:40:04,177 [INFO ] W-9000-model_1-stdout MODEL_LOG - RuntimeError: 507 - System out of memory
  | 2022-05-05T00:10:04.900+05:30 | 2022-05-04T18:40:04,177 [INFO ] W-9000-model_1 TS_METRICS - WorkerThreadTime.ms:3\|#Level:Host\|#hostname:model.aws.local,timestamp:null
  | 2022-05-05T00:10:04.900+05:30 | 2022-05-04T18:40:04,352 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_ERROR
  | 2022-05-05T00:10:04.900+05:30 | 2022-05-04T18:40:04,353 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr
  | 2022-05-05T00:10:04.900+05:30 | 2022-05-04T18:40:04,353 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout

metrics

2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,722 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
-- | --
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,789 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,792 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:11.85373306274414\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,792 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:12.500110626220703\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,793 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:51.3\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,794 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:5335.06640625\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,795 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2089.80859375\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:44.242+05:30 | 2022-05-04T18:07:43,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:30.6\|#Level:Host\|#hostname:model.aws.local,timestamp:1651687663
  | 2022-05-04T23:37:45.242+05:30 | 2022-05-04T18:07:44,772 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000

I was constantly monitoring the metrics but the CPU and memory usage never went above 50 percent, which doesn't explain above error ??

jluntamazon commented 2 years ago

You may be running out of memory due to the size of the model in combination with using a multi-process oriented serving framework.

To get a baseline for expected memory consumption, we tested the example of RobertaForCausalLM.

Out of the box, a single model with this configuration will consume approximately 700mb of memory per process with no other serving framework overhead (measured using psutil for the current process). For 4 workers, this means that the models alone will consume ~2.8gb of host memory.

Memory consumption is caused by loading the embedding table to CPU in addition to holding the underlying Neuron model binary artifact in memory. Currently there is no way of reducing this memory when using multiple processes.

The out of memory error is likely due to this Neuron memory usage in addition to the memory used by torchserve. We may need to see more specific model/environment configurations to be able to determine exactly where memory is being consumed.

DevBey commented 2 years ago

Hi @jluntamazon,

Can we somehow reduce the number of workers to suppose two, would that help ??

this is the current memory usage in inf1.2xlarge

jluntamazon commented 2 years ago

In addition to the previous sources of memory consumption, there may also be additional overhead from SageMaker itself. Previously we had measured ~700mb of overhead.

Decreasing the number of workers should decrease memory usage, but the ideal configuration to maximize performance is to have one worker per NeuronCore. If you do not have enough workers, you may be leaving NeuronCores idle.

Since this is primarily a host memory limitation, currently an inf1.2xlarge may be a better fit for your deployment since you will be able to use all NeuronCores and have sufficient host memory to fit the models.

If an inf1.2xlarge does not fit your price/performance budget, additional torch-neuron optimizations to reduce memory usage are planned to be added. Depending on the model this may significantly reduce memory usage, but this improvement will be specific to your configuration. We expect this to be available in an upcoming release.

DevBey commented 2 years ago

Hi @jluntamazon ,

yeah sure that makes sense for the out-of-memory issue.

as for the solution is there a way for reducing the no of neuron cores, I don't think it would affect us much as our is more of a serial processing pipeline.

what's the timeline for the aforementioned memory optimization ??

aws-maens commented 2 years ago

Adding roadmap item (https://github.com/aws/aws-neuron-sdk/issues/428) and closing the issue.

aws-neuron / aws-neuron-sdk

runtimeError: 507 - System out of memory sagemaker #426