aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
462 stars 154 forks source link

Varying batch sizes results in errors #613

Closed Limess closed 1 year ago

Limess commented 1 year ago

Hello, we're experimenting with AWS neuron and inf1 instances and ran into some issues when trying to get batching to work.

The model we're using is a variant of https://huggingface.co/cardiffnlp/roberta-base-sentiment.

We're running on AWS Sagemaker, this is the inference.py file we're using on Sagemaker

# based on inference.py in https://huggingface.co/blog/bert-inferentia-sagemaker
# we need to supply a predict_fn as sagemaker does not support zero-code deployments with AWS inferentia

import os

import torch
import torch.neuron
from transformers import AutoConfig, AutoTokenizer

# To use one neuron core per worker
os.environ["NEURON_RT_NUM_CORES"] = "1"

# saved weights name
AWS_NEURON_TRACED_WEIGHTS_NAME = "neuron_model.pt"

def model_fn(model_dir):
    # load tokenizer and neuron model from model_dir
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
    model_config = AutoConfig.from_pretrained(
        model_dir,
        # we have somehow not set this in distilroberta_wo_target so instead we use a custom one
        id2label={0: "negative", 1: "neutral", 2: "positive"},
    )

    # https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/torch-neuron/torch-neuron-dataparallel-app-note.html
    model_parallel = torch.neuron.DataParallel(model)

    return model_parallel, tokenizer, model_config

def predict_fn(data, model_tokenizer_model_config):
    # destruct model, tokenizer and model config
    model, tokenizer, model_config = model_tokenizer_model_config

    # create embeddings for inputs
    inputs = data.pop("inputs", data)
    parameters = data.pop("parameters", None)

    embeddings = tokenizer(
        inputs,
        return_tensors="pt",
        max_length=model_config.traced_sequence_length,
        padding="longest",
        truncation="longest_first",
    )
    # convert to tuple for neuron model
    model_inputs = tuple(embeddings.values())

    # run prediciton
    with torch.no_grad():
        if parameters is not None:
            logits = model(*model_inputs, **parameters)[0]
        else:
            logits = model(*model_inputs)[0]

        predictions = torch.nn.Softmax(dim=1)(logits)

    return [
        {
            "label": model_config.id2label[item.argmax().item()],
            "scores": {
                model_config.id2label[i]: score.item() for i, score in enumerate(item)
            },
        }
        for item in predictions
    ]

When invoking the sagemaker endpoint with any batch size other than the compiled size (4) in the inputs list, we see the following error. I was expecting this to work transparently with other batch sizes (1, 8, 16, etc) based on the documentation for torch.neuron.DataParallel.

The following operation failed in the TorchScript interpreter.\
Traceback of TorchScript (most recent call last):\
  File \\\"code/__torch__/torch_neuron/runtime/___torch_mangle_229.py\\\", line 24, in forward\
    _8 \\u003d torch.embedding(CONSTANTS.c5, _5, 1)\
    model \\u003d _NeuronGraph_60.model\
    _9 \\u003d ops.neuron.forward_v2_1([argument_2, _6, _7, _8], model)\
         ~~~~~~~~~~~~~~~~~~~~~~~ \\u003c--- HERE\
    return (_9,)\
RuntimeError: \
    Incorrect tensor shape at input tensor #0: received 4 33, expected 4 512.\
    Incorrect tensor shape at input tensor #1: received 4 33 768, expected 4 512 768.\
    Incorrect tensor shape at input tensor #3: received 4 33 768, expected 4 512 768.\
\"

Versions - we're compiling with these versions

protobuf==3.20.1
torch-neuron==1.12.1.2.5.8.0
neuron-cc[tensorflow]==1.13.5.0+7dcf000a6
transformers==4.24.0

Runtime - we're using this docker image on Sagemaker

763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-inference-neuron:1.10.2-transformers4.20.1-neuron-py37-sdk1.19.1-ubuntu18.04

This is the gist of the compilation code:

MAX_LENGTH=512
BATCH_SIZE=4

tokenizer = AutoTokenizer.from_pretrained(PATH_TO_MODEL)
model = AutoModelForSequenceClassification.from_pretrained(
    PATH_TO_MODEL, torchscript=True
)

# create dummy input of length BATCH_SIZE
dummy_input = ["dummy input which will be padded later" for _ in [None] * BATCH_SIZE]

embeddings = tokenizer(
    dummy_input,
    max_length=MAX_LENGTH,
    padding="max_length",
    return_tensors="pt",
    truncation=True,
)
neuron_inputs = tuple(embeddings.values())

# compile model with torch.neuron.trace and update config
model_neuron = torch.neuron.trace(
    model,
    neuron_inputs,
    verbose=1,
)

I'm happy to share more of the compilation code if it would help.

Any suggestions are welcome. It's quite possible we're doing something entirely ridiculous here and expecting it to work.

hannanjgaws commented 1 year ago

Hi @Limess:

Based on the error you’re seeing, it looks like the compile time sequence length (512) is not the same as the sequence length that you’re using at inference time (33) in the predict_fn. The compile time sequence length must match the inference time sequence length on Neuron. (The exception to this is if you use torch_neuron.DataParallel with (dim = 1) - but I’m not sure you want to do this for your application, because it means you’ll be breaking up a sequence of words.)

Can you make sure that the sequence length of your inference time inputs in predict_fn matches the sequence length that you’re using for compilation? Additionally, you’ll also want to make sure that the padding and truncation settings are the same at compilation time and inference time. Based on your compilation tokenizer and runtime tokenizer, it looks like this is not the case:

Compilation Time:

embeddings = tokenizer(
    dummy_input,
    max_length=MAX_LENGTH,
    padding="max_length",
    return_tensors="pt",
    truncation=True,
)

Inference Time:

embeddings = tokenizer(
    inputs,
    return_tensors="pt",
    max_length=model_config.traced_sequence_length,
    padding="longest",
    truncation="longest_first",
)
Limess commented 1 year ago

Thanks, I've updated this code so ensure it pads up to the max length and receive a different error:

Prediction error
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 234, in handle
    response = self.transform_fn(self.model, input_data, content_type, accept)
  File "/opt/conda/lib/python3.7/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 190, in transform_fn
    predictions = self.predict(processed_data, model)
  File "/.sagemaker/mms/models/model/inference.py", line 62, in predict_fn
    logits = model(*model_inputs)[0]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch_neuron/data_parallel.py", line 223, in forward
    return self.loaded_modules[0](*inputs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "code/__torch__/torch_neuron/runtime/___torch_mangle_229.py", line 24, in forward
    _8 = torch.embedding(CONSTANTS.c5, _5, 1)
    model = _NeuronGraph_60.model
    _9 = ops.neuron.forward_v2_1([argument_2, _6, _7, _8], model)
         ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return (_9,)
RuntimeError: Inconsistent batch sizes found on inputs. All batch tensors must have the same dim 0 size.
Input tensor #0 shape: 16 512
Input tensor #1 shape: 16 512 768
Input tensor #2 shape: 4 512 768
Input tensor #3 shape: 16 512 768

I've logged the returned keys/values from the tokenizer and I don't understand how these correspond to the input tensors which are reporting errors:

Tokenized output keys
dict_keys(['input_ids', 'attention_mask'])

Tokenized output shape: 
[[16, 512], [16, 512]]
jeffhataws commented 1 year ago

@Limess , thank you again for filing the issue.

We were able to reproduce a similar error to what you’re seeing using the open source cardiffnlp/roberta-base-sentiment model. The error you’re seeing is likely caused by an issue in our partitioner. Models that contain unsupported operators, such as aten::embedding, are partitioned to run on CPU. During partitioning, additional operators, such as aten::size are also sent to CPU. However, there’s an issue in our partitioner where the input batch sizes are not properly recorded at inference time when aten::size operators are sent to CPU. This is known issue and is being worked on.

It’s likely that the easiest way to fix this issue is to compile aten::embedding instead of partitioning it to CPU. This can be accomplished by using the fallback=False flag during compilation to compile all operators in your model:

traced = torch_neuron.trace(model, example, dynamic_batch_size=True, fallback=False)

Note that compiling embedding operators on Neuron tends to work when the embedding table is relatively small. Embedding operator support is disabled by default because there are many cases where moving this to device causes worse performance.

Can you retry compiling your model using fallback=False to see if it resolves the error you’re seeing? It’s possible that using fallback=False will cause compilation errors if your model contains additional types of unsupported operators. Can you provide your compilation logs if you hit errors when you use fallback=False ? We can look at these to see if there are any other problematic operators.

Limess commented 1 year ago

Thanks Jeff! I'll do that any get back to you.

To clarify - should we also still be setting dynamic_batch_size=True when compiling and using torch.neuron.DataParallel at runtime?

Limess commented 1 year ago

That worked without issue, I'll leave this open to give you opportunity to answer my above query around dynamic_batch_size, afterwards - feel free to close this.

Thanks for your help it was extremely useful.

hannanjgaws commented 1 year ago

Glad to hear you were able to fix the issues you were encountering!

torch_neuron.DataParallel automatically enables dynamic batching, so there is no need to set dynamic_batch_size=True when you compile your model. You can learn more about how torch_neuron.DataParallel uses dynamic batching in this documentation.

We will close this issue because you were able to resolve the issues in this ticket. Please feel free to open a new ticket if you encounter any other issues using Neuron in the future.