Closed Limess closed 1 year ago
Hi @Limess:
Based on the error you’re seeing, it looks like the compile time sequence length (512
) is not the same as the sequence length that you’re using at inference time (33
) in the predict_fn
. The compile time sequence length must match the inference time sequence length on Neuron. (The exception to this is if you use torch_neuron.DataParallel
with (dim = 1
) - but I’m not sure you want to do this for your application, because it means you’ll be breaking up a sequence of words.)
Can you make sure that the sequence length of your inference time inputs in predict_fn
matches the sequence length that you’re using for compilation? Additionally, you’ll also want to make sure that the padding and truncation settings are the same at compilation time and inference time. Based on your compilation tokenizer and runtime tokenizer, it looks like this is not the case:
Compilation Time:
embeddings = tokenizer(
dummy_input,
max_length=MAX_LENGTH,
padding="max_length",
return_tensors="pt",
truncation=True,
)
Inference Time:
embeddings = tokenizer(
inputs,
return_tensors="pt",
max_length=model_config.traced_sequence_length,
padding="longest",
truncation="longest_first",
)
Thanks, I've updated this code so ensure it pads up to the max length and receive a different error:
Prediction error
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 234, in handle
response = self.transform_fn(self.model, input_data, content_type, accept)
File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 190, in transform_fn
predictions = self.predict(processed_data, model)
File "/.sagemaker/mms/models/model/inference.py", line 62, in predict_fn
logits = model(*model_inputs)[0]
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch_neuron/data_parallel.py", line 223, in forward
return self.loaded_modules[0](*inputs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "code/__torch__/torch_neuron/runtime/___torch_mangle_229.py", line 24, in forward
_8 = torch.embedding(CONSTANTS.c5, _5, 1)
model = _NeuronGraph_60.model
_9 = ops.neuron.forward_v2_1([argument_2, _6, _7, _8], model)
~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return (_9,)
RuntimeError: Inconsistent batch sizes found on inputs. All batch tensors must have the same dim 0 size.
Input tensor #0 shape: 16 512
Input tensor #1 shape: 16 512 768
Input tensor #2 shape: 4 512 768
Input tensor #3 shape: 16 512 768
I've logged the returned keys/values from the tokenizer and I don't understand how these correspond to the input tensors which are reporting errors:
Tokenized output keys
dict_keys(['input_ids', 'attention_mask'])
Tokenized output shape:
[[16, 512], [16, 512]]
@Limess , thank you again for filing the issue.
We were able to reproduce a similar error to what you’re seeing using the open source cardiffnlp/roberta-base-sentiment model. The error you’re seeing is likely caused by an issue in our partitioner. Models that contain unsupported operators, such as aten::embedding
, are partitioned to run on CPU. During partitioning, additional operators, such as aten::size are also sent to CPU. However, there’s an issue in our partitioner where the input batch sizes are not properly recorded at inference time when aten::size
operators are sent to CPU. This is known issue and is being worked on.
It’s likely that the easiest way to fix this issue is to compile aten::embedding
instead of partitioning it to CPU. This can be accomplished by using the fallback=False
flag during compilation to compile all operators in your model:
traced = torch_neuron.trace(model, example, dynamic_batch_size=True, fallback=False)
Note that compiling embedding operators on Neuron tends to work when the embedding table is relatively small. Embedding operator support is disabled by default because there are many cases where moving this to device causes worse performance.
Can you retry compiling your model using fallback=False
to see if it resolves the error you’re seeing? It’s possible that using fallback=False
will cause compilation errors if your model contains additional types of unsupported operators. Can you provide your compilation logs if you hit errors when you use fallback=False
? We can look at these to see if there are any other problematic operators.
Thanks Jeff! I'll do that any get back to you.
To clarify - should we also still be setting dynamic_batch_size=True
when compiling and using torch.neuron.DataParallel
at runtime?
That worked without issue, I'll leave this open to give you opportunity to answer my above query around dynamic_batch_size
, afterwards - feel free to close this.
Thanks for your help it was extremely useful.
Glad to hear you were able to fix the issues you were encountering!
torch_neuron.DataParallel
automatically enables dynamic batching, so there is no need to set dynamic_batch_size=True
when you compile your model. You can learn more about how torch_neuron.DataParallel
uses dynamic batching in this documentation.
We will close this issue because you were able to resolve the issues in this ticket. Please feel free to open a new ticket if you encounter any other issues using Neuron in the future.
Hello, we're experimenting with AWS neuron and inf1 instances and ran into some issues when trying to get batching to work.
The model we're using is a variant of https://huggingface.co/cardiffnlp/roberta-base-sentiment.
We're running on AWS Sagemaker, this is the
inference.py
file we're using on SagemakerWhen invoking the sagemaker endpoint with any batch size other than the compiled size (4) in the
inputs
list, we see the following error. I was expecting this to work transparently with other batch sizes (1, 8, 16, etc) based on the documentation fortorch.neuron.DataParallel
.Versions - we're compiling with these versions
Runtime - we're using this docker image on Sagemaker
This is the gist of the compilation code:
I'm happy to share more of the compilation code if it would help.
Any suggestions are welcome. It's quite possible we're doing something entirely ridiculous here and expecting it to work.