Closed Matthieu-Tinycoaching closed 3 years ago
From looking at the tensors' shapes it appears that bentoML sends a microbatch of 3 (?) requests to a model that was compiled to do one. To support this you need to enable "dynamic batching". E.g.
model_neuron = torch.neuron.trace(model, example_inputs=example_inputs, dynamic_batch_size=True)
This should cause torch_neuron to chop up incoming microbatch into requests of compiled batch size - 1 in your case.
Regarding padding - please take a look here: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb
It has an example of correct usage of padding.
Hi @awsilya thanks for the advice. Do you think it is better to use dynamic batching with compiled batch size of 1 or greater than it?
It's application specific. Think of it this way - larger compiled batch usually has better performance at the hardware level. The most performant case is when you compile for a batch of X and the app submits microbatches that are multiples of X.
However, if you compile for X but consistently submit microbatches that are less then X then we need to pad and move extra date to the device which will negatively impact the performance.
For Inferentia we recommend to compile the smallest batch that results in highest throughput. It is typically much smaller than on gpus since Inferentia is designed to maximize throughout at small batch sizes. Eg for BERT large sequence 128 it is batch 6. To find the optimized value for your mode we suggest to experiment with few batch sizes starting from 1. The second consideration is speed of model serving/dynamic batching. You can find more details on https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/bert_demo/bert_demo.html#co[…]ion and for general batching also https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/technotes/neuroncore-batching.html#neuron-batching
Please reopen the issue if the documentation does not help with your model.
Hi there! @aws-zejdaj, how much of the padding and parallelisation mentioned in the example notebooks above is covered by the new(er) DataParallel API?
Hi @vprecup, the torch.neuron.DataParallel
API handles parallelizing large batch sizes across multiple NeuronCores to maximize throughput. torch.neuron.DataParallel
also supports dynamic batching to run inference with variable batch sizes. You can learn more about the torch.neuron.DataParallel
API here (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-torch-neuron-dataparallel-api.html) and read about it’s capabilities here (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html).
torch.neuron.DataParallel
does not automatically handle input padding. To accommodate for applications with variable sequence lengths / input shapes, we recommend you try using “bucketing.” You can refer to https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/bucketing-app-note.html for a guide on how to implement bucketing for different machine learning applications on Neuron.
Thanks for the clarification, @hannanjgaws! One more thing, as I wasn't sufficiently explicit with the data parallelisation part of my question: is there a functional overlap between torch.neuron.DataParallel
and compiling the model with the dynamic_batch_size=True
flag? If not, what is the difference between the two?
There is overlap between torch.neuron.DataParallel
and compiling a model with the dynamic_batch_size=True
flag.
The dynamic_batch_size=True
flag inserts logic at compilation time that tells the Neuron runtime to automatically handle variable sized batches of data during inference. Dynamic sizing is restricted to the 0th dimension of a tensor.
torch_neuron.DataParallel
splits the batched input into smaller batches. By default,torch_neuron.DataParallel
attempts to use dynamic batching. in this case, dynamic batching is enabled at runtime and makes it possible for the smaller batches created by torch_neuron.DataParallel
to not match the original compilation-time batch size (functionally this is just like using the dynamic_batch_size=True
flag). This enables you to pass dynamic batch sizes during inference when you're using the torch_neuron.DataParallel
module. You can learn more about how torch_neuron.DataParallel
dynamic batching works in this document: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html#dynamic-batching.
Please let us know if you have any additional questions.
Hello,
I have two related problems with torchscript compilation from
torch_neuron
package. It seems that all is static as thebatch_size
andpadding_length
for the tokenizer. Whenever, I want to use microbatching from bentoML and since I have compiled the model with abatch_size
of 1, at each inference of batch_size automatically determined by bentoML I got error like that:The same is observable with
padding
, since whenever I want to usepadding=True
instead ofpadding='max_length'
the call to the model expects an input tensor of sizemax_length
. This is strange since with original torchscript the problem doesn't exist. Would you have any advice regarding this?Thanks!