Open jyang-aws opened 5 months ago
To allow for dynamic batching to occur, the 0th dimension must be identical for all model inputs and outputs. If it is not identical then current Neuron version does not know how to split batches apart and concatenate them back together.
One common root cause is if the outputs are reshaped or if the batch dimension is represented as a List. In this case, the most likely cause is that the batched outputs need to be concatenated before being returned.
For example the following function is not compatible since the input has shape [batch, …] but the output has shape […]
def dynamic_batch_incompatible(tensor):
b0 = tensor[0] + 1
b1 = tensor[1] + 2
return b0, b1
This can be fixed by concatenating the outputs so that both the inputs and the outputs have shape [batch, …]
def dynamic_batch_compatible(tensor):
b0 = tensor[0] + 1
b1 = tensor[1] + 2
return torch.cat([b0, b1], dim=0)
Now Neuron can trivially slice the inputs and concatenate the outputs dynamically. If using a single known fixed batch size, one option is to disable dynamic batching: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-refe[…]ide/inference/api-torch-neuronx-data-parallel.html
The issue customer faces with batch inferencing using this approach proposed in https://github.com/aws-neuron/aws-neuron-sdk/issues/906.
"My output is in Tuple[List[torch.Tensor]] which works well for batch_size =1. But when I try to use DataParallel on the traced model, it says inconsistent size between inputs and outputs. I looked into jit trace and found that even converting directly to torch.tensor would not work as torch.tensor is treated as a constant. I tried creating torch.zeroes(batch_size , seq_length) and then replacing the values in this tensor but that also did not work."
This observation is based on torch neuronx 2.1.2