aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
450 stars 152 forks source link

Dynamic batching and padding #275

Closed Matthieu-Tinycoaching closed 3 years ago

Matthieu-Tinycoaching commented 3 years ago

Hello,

I have two related problems with torchscript compilation from torch_neuron package. It seems that all is static as the batch_size and padding_length for the tokenizer. Whenever, I want to use microbatching from bentoML and since I have compiled the model with a batch_size of 1, at each inference of batch_size automatically determined by bentoML I got error like that:

[2021-06-04 10:05:45,400] ERROR - Error caught in API function:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/test_bentoml_inf1_fresh/lib/python3.6/site-packages/bentoml/service/inference_api.py", line 177, in wrapped_func
    return self._user_func(*args, **kwargs)
  File "/home/ubuntu/bentoml/repository/StsBTCustomInf1PytorchService/20210604095716_503730/StsBTCustomInf1PytorchService/sts_transformer_pt_inf1_batchTrue_custom_requirement_file.py", line 41, in predict
    model_output = self.artifacts.model(tensor=encoded_input['input_ids'], tensor0=encoded_input['attention_mask'])
  File "/home/ubuntu/anaconda3/envs/test_bentoml_inf1_fresh/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch_neuron/convert.py", line 38, in forward
    _18 = torch.embedding(CONSTANTS.c5, _14, 1, False, False)
    _19 = [torch.add(_17, _18, alpha=1), _10, tensor0]
    _20 = ops.neuron.forward_1(_19, CONSTANTS.c6, CONSTANTS.c7, CONSTANTS.c8)
          ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return _20

Traceback of TorchScript, original code (most recent call last):
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/decorators.py(309): neuron_function
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/_trace.py(779): trace
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/decorators.py(313): create_runnable
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/decorators.py(194): trace
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(448): _convert_item
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(186): run_op
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(176): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(365): compile_fused_operators
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(121): trace
torchserve_sts_transformer_torchscript_cpu_pad128_b1.py(76): <module>
RuntimeError: 
    Incorrect tensor shape at input tensor #0: received 3 128 768, expected 1 128 768.
    Incorrect tensor shape at input tensor #1: received 3 1 1 128, expected 1 1 1 128.
    Incorrect tensor shape at input tensor #2: received 3 128, expected 1 128.

The same is observable with padding, since whenever I want to use padding=True instead of padding='max_length' the call to the model expects an input tensor of size max_length. This is strange since with original torchscript the problem doesn't exist. Would you have any advice regarding this?

Thanks!

awsilya commented 3 years ago

From looking at the tensors' shapes it appears that bentoML sends a microbatch of 3 (?) requests to a model that was compiled to do one. To support this you need to enable "dynamic batching". E.g.

model_neuron = torch.neuron.trace(model, example_inputs=example_inputs, dynamic_batch_size=True)

This should cause torch_neuron to chop up incoming microbatch into requests of compiled batch size - 1 in your case.

Regarding padding - please take a look here: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb

It has an example of correct usage of padding.

Matthieu-Tinycoaching commented 3 years ago

Hi @awsilya thanks for the advice. Do you think it is better to use dynamic batching with compiled batch size of 1 or greater than it?

awsilya commented 3 years ago

It's application specific. Think of it this way - larger compiled batch usually has better performance at the hardware level. The most performant case is when you compile for a batch of X and the app submits microbatches that are multiples of X.

However, if you compile for X but consistently submit microbatches that are less then X then we need to pad and move extra date to the device which will negatively impact the performance.

aws-zejdaj commented 3 years ago

For Inferentia we recommend to compile the smallest batch that results in highest throughput. It is typically much smaller than on gpus since Inferentia is designed to maximize throughout at small batch sizes. Eg for BERT large sequence 128 it is batch 6. To find the optimized value for your mode we suggest to experiment with few batch sizes starting from 1. The second consideration is speed of model serving/dynamic batching. You can find more details on https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/tutorials/bert_demo/bert_demo.html#co[…]ion and for general batching also https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/technotes/neuroncore-batching.html#neuron-batching

aws-zejdaj commented 3 years ago

Please reopen the issue if the documentation does not help with your model.

vprecup commented 2 years ago

Hi there! @aws-zejdaj, how much of the padding and parallelisation mentioned in the example notebooks above is covered by the new(er) DataParallel API?

hannanjgaws commented 2 years ago

Hi @vprecup, the torch.neuron.DataParallel API handles parallelizing large batch sizes across multiple NeuronCores to maximize throughput. torch.neuron.DataParallel also supports dynamic batching to run inference with variable batch sizes. You can learn more about the torch.neuron.DataParallel API here (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-torch-neuron-dataparallel-api.html) and read about it’s capabilities here (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html).

torch.neuron.DataParallel does not automatically handle input padding. To accommodate for applications with variable sequence lengths / input shapes, we recommend you try using “bucketing.” You can refer to https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/bucketing-app-note.html for a guide on how to implement bucketing for different machine learning applications on Neuron.

vprecup commented 2 years ago

Thanks for the clarification, @hannanjgaws! One more thing, as I wasn't sufficiently explicit with the data parallelisation part of my question: is there a functional overlap between torch.neuron.DataParallel and compiling the model with the dynamic_batch_size=True flag? If not, what is the difference between the two?

hannanjgaws commented 2 years ago

There is overlap between torch.neuron.DataParallel and compiling a model with the dynamic_batch_size=True flag.

The dynamic_batch_size=True flag inserts logic at compilation time that tells the Neuron runtime to automatically handle variable sized batches of data during inference. Dynamic sizing is restricted to the 0th dimension of a tensor.

torch_neuron.DataParallel splits the batched input into smaller batches. By default,torch_neuron.DataParallel attempts to use dynamic batching. in this case, dynamic batching is enabled at runtime and makes it possible for the smaller batches created by torch_neuron.DataParallel to not match the original compilation-time batch size (functionally this is just like using the dynamic_batch_size=True flag). This enables you to pass dynamic batch sizes during inference when you're using the torch_neuron.DataParallel module. You can learn more about how torch_neuron.DataParallel dynamic batching works in this document: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html#dynamic-batching.

Please let us know if you have any additional questions.