Closed DevBey closed 2 years ago
You may be running out of memory due to the size of the model in combination with using a multi-process oriented serving framework.
To get a baseline for expected memory consumption, we tested the example of RobertaForCausalLM.
Out of the box, a single model with this configuration will consume approximately 700mb of memory per process with no other serving framework overhead (measured using psutil
for the current process). For 4 workers, this means that the models alone will consume ~2.8gb of host memory.
Memory consumption is caused by loading the embedding table to CPU in addition to holding the underlying Neuron model binary artifact in memory. Currently there is no way of reducing this memory when using multiple processes.
The out of memory error is likely due to this Neuron memory usage in addition to the memory used by torchserve. We may need to see more specific model/environment configurations to be able to determine exactly where memory is being consumed.
Hi @jluntamazon,
Can we somehow reduce the number of workers to suppose two, would that help ??
this is the current memory usage in inf1.2xlarge
In addition to the previous sources of memory consumption, there may also be additional overhead from SageMaker itself. Previously we had measured ~700mb of overhead.
Decreasing the number of workers should decrease memory usage, but the ideal configuration to maximize performance is to have one worker per NeuronCore. If you do not have enough workers, you may be leaving NeuronCores idle.
Since this is primarily a host memory limitation, currently an inf1.2xlarge may be a better fit for your deployment since you will be able to use all NeuronCores and have sufficient host memory to fit the models.
If an inf1.2xlarge does not fit your price/performance budget, additional torch-neuron
optimizations to reduce memory usage are planned to be added. Depending on the model this may significantly reduce memory usage, but this improvement will be specific to your configuration. We expect this to be available in an upcoming release.
Hi @jluntamazon ,
yeah sure that makes sense for the out-of-memory issue.
as for the solution is there a way for reducing the no of neuron cores, I don't think it would affect us much as our is more of a serial processing pipeline.
what's the timeline for the aforementioned memory optimization ??
Adding roadmap item (https://github.com/aws/aws-neuron-sdk/issues/428) and closing the issue.
Hi, faced the above error when trying to deploy xlm-roberta of flair using inf1.xlarge instance, the deployment kept failing and worker kept dying.
when I changed the instance to inf1.2xlarge the error went away and model got deployed !!
metrics
I was constantly monitoring the metrics but the CPU and memory usage never went above 50 percent, which doesn't explain above error ??