Tutorial Compiling and Deploying HuggingFace Pretrained BERT is not working

djstrong commented 4 years ago

I have tried original tutorial: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb:

$ python3 -m pip install torch-neuron neuron-cc[tensorflow] transformers --upgrade --extra-index-url=https://pip.repos.neuron.amazonaws.com
...
ERROR: torchvision 0.4.2 requires torch==1.3.1, which is not installed.
ERROR: tensorflow-neuron 1.15.2.1.0.1796.0 requires tensorboard-neuron<1.16.0,>=1.15.0, which is not installed.
...
Successfully installed dmlc-nnvm-1.0.2732.0+6020726378 dmlc-topi-1.0.2732.0+6020726378 dmlc-tvm-1.0.2732.0+6020726378 inferentia-hwm-1.0.1516.0+6020726378 markdown-3.2.2 neuron-cc-1.0.16861.0+6021080341 numpy-1.18.2 tensorboard-1.15.0 tensorflow-1.15.0 torch-neuron-1.0.1386.0

$ python3 test_temp.py
Traceback (most recent call last):
  File "test_temp.py", line 4, in <module>
    import torch.neuron
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/neuron/__init__.py", line 1, in <module>
    from torch_neuron import *
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/__init__.py", line 7, in <module>
    from .convert import trace
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py", line 7, in <module>
    from torch_neuron import graph
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py", line 4, in <module>
    from torch_neuron.resolve_function import get_function
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/resolve_function.py", line 7, in <module>
    from torch_neuron import tensor_info
  File "/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/tensor_info.py", line 84, in <module>
    3: torch.channels_last_3d,
AttributeError: module 'torch' has no attribute 'channels_last_3d'

aws-diamant commented 4 years ago

Hello djstrong,

Thanks for posting your question. This seems like a dependency issue based on the pip install errors you posted.

We recommend using a python virtual environment, for simpler management of packages. I’d recommend to follow the instructions in https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-install-guide.md, and then try the pip install again. Please make sure to do that from the virtual environment as explained in this link.

You should not see any error reported after running the command below:

$ python3 -m pip install torch-neuron neuron-cc[tensorflow] transformers --upgrade --extra-index-url=https://pip.repos.neuron.amazonaws.com

Please let us know if the issue is resolved, we’d be happy to assist further. Once resolved, we will update the tutorial for better clarity going forward.

/Ron

djstrong commented 4 years ago

Thank you.

I have used "AWS Deep Learning AMI (Ubuntu 18.04)" and an environment prepared by AWS: "for PyTorch (+AWS Neuron) with Python3" source activate aws_neuron_pytorch_p36.

Now, I don't have access to AWS, but I will try on other machine.

aws-diamant commented 4 years ago

Thanks for sharing this information, djstrong!

I was able to reproduce the issue that you're reporting after starting from source activate aws_neuron_pytorch_p36. We're looking into this, and will update as soon as we resolve the issue.

As a temporary workaround, I suggest that you skip the source activate aws_neuron_pytorch_p36 step, and follow the instructions in https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-install-guide.md to get the most up-to-date Neuron packages.

djstrong commented 4 years ago

Thank you, the workaround works - it saves compiled model.

djstrong commented 4 years ago

Inference from tutorial also works. However, the inference time is 3 times longer than on NVIDIA T4.

aws-diamant commented 4 years ago

That's great to hear, djstrong!

Now we're getting to the fun part :) With this BERTBase intro tutorial, we only got the model to compile. We did not yet do performance tuning. First level performance improvement opportunities are listed below.

Use all 4 NeuronCores: In this tutorial, we are only using a single NeuronCore, out of the 4 NeuronCores available in Inferentia. So we should enable all 4 cores, to get better parallelism. You can review the following link for relevant info: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/neuron_bert_mrpc_benchmark.ipynb. I also attached some code snippets at the bottom of my reply, showing how to modify the tutorial code to use multiple NeuronCores. You can adjust num_neuron_cores to check the impact of adding more cores on performance.

Batching: Batching allows the NeuronCores to read the weights once, cache them, and use them for multiple inferences, thus increasing overall performance. Neuron can typically achieve near-max performance for low batch sizes (2-8), with minimal impact on latency. We definitely encourage you to try a couple of different batch sizes, and find what work best for your use-case. From our experience, BERT models can see significant performance improvements by choosing the optimal batch size.

bert_compile.py:
-----------------
...
# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron, using optimization level -O2
num_neuron_cores = 4
model_names = ['bert_neuron0.pt', 'bert_neuron1.pt', 'bert_neuron2.pt', 'bert_neuron3.pt']
models = []
for i in range(num_neuron_cores):
    if (i == 0):
        model = torch.neuron.trace(model, example_inputs_paraphrase, compiler_args=['-O2'])
    else:
        model = torch.neuron.trace(model, example_inputs_paraphrase, compiler_args=['-O2'], compiler_workdir=compiler_workdir, verbose=1)
    model.save(model_names[i])

bert_infer.py:
--------------
import time
import os
from concurrent import futures
...
# Load TorchScript back
model_files = ('bert_neuron0.pt', 'bert_neuron1.pt', 'bert_neuron2.pt', 'bert_neuron3.pt')
models = []
for m in model_files:
    model = torch.jit.load(m)
    models.append(model)

# Verify the TorchScript works on both example inputs
os.environ['NEURONCORE_GROUP_SIZES'] = '1,1,1,1'
os.environ['NEURON_MAX_NUM_INFERS'] = '-1'
...
classes = ['not paraphrase', 'paraphrase']
num_infer = 300
num_neuron_cores = 4
executor = futures.ThreadPoolExecutor(max_workers=num_neuron_cores)
fut_list = []

start = time.time()
for i in range(num_infer):
    for k in range(num_neuron_cores):
        fut = executor.submit(models[k], *example_inputs_paraphrase)
        fut_list.append(fut)

result_list = [fut.result() for fut in fut_list]
end = time.time()
total_time = (end-start) / (num_infer*num_neuron_cores)

Lastly, please note that more BERT performance improvements are coming in the next release (August), in addition to automated NeuronCore parallelism, aiming to make Neuron more performant, as well as easier to use.

/Ron

djstrong commented 4 years ago

Thank you! So, we will probably revisit in August. We are interested in latency for inference of one text (with average length of 512 subtokens).

Are all NeuronCores independent or they can cooperate to make one inference? For roberta-base I have got about 35 ms for one inference and on NVIDIA T4 it is 11 ms.
The models/files 'bert_neuron0.pt', 'bert_neuron1.pt' and so on are different (diff bert_neuron0.pt bert_neuron1.pt)? Can we load different models to each NeuronCore?
I have problems with inference with batch size more than one - the error was something with wrong input. Also, it was not possible to send shorter text (without padding to max sequence length - 512). On GPU it is possible and the inference is faster.

aws-diamant commented 4 years ago

Yes, the upcoming August release should be very useful for this use-case. Just to give you an early data-point, I did a quick benchmark for BERTBase-seqlen512-batch1 using the upcoming Aug release, and seeing <10mSec per inference. If time is of the essence, please let me know, and I can try and facilitate an early private preview.

To address your questions:

Yes, the NeuronCores can either be independent, or can cooperate together to compute a single inference (https://github.com/aws/aws-neuron-sdk/blob/master/docs/technotes/neuroncore-pipeline.md).
Yes, you can load same or different models to the different NeuronCores. Some possible use-cases for different models are multi-NN pipelines and majority-voting.
BERTBase-seqlen512 should achieve near maximal performance for batch=1, so it should be ok to continue without batching. There's indeed a possibility to shorten the computation time for shorter sentences, currently Neuron achieves this by compiling for multiple sequence lengths (512/256/128/64), and using the best-fit sequence length at runtime.

/Ron

djstrong commented 4 years ago

Thank you, it is helpful.

3. Neuron achieves this by compiling for multiple sequence lengths (512/256/128/64), and using the best-fit sequence length at runtime.

So there are 4 separate models or one supporting 4 sequence lengths?

We are interested also in batching, because some texts may be longer than 512 and then they are split to a couple of fragments. Does batching also require compiling for multiple batch sizes?

Could you provide how much memory uses bert-base using one NeuronCore and using 4 NeuronCores (your example) and how much memory is on AWS Inferentia?

aws-diamant commented 4 years ago

There's a single model with variable sequence-length and batch-size.

The entry point would be to compile that model only once, to a fixed sequence length (512) and fixed batch-size (1), and then pad the inputs as needed (note: that's what I did in order to share the expected Aug release performance numbers with you). This alone should achieve <10mSec per inference, and best-in-class cost per inference.

From here, you can optionally apply a set of incremental optimizations that drive the performance higher and thus cost-per-inference lower (sort of 'expert-mode' optimizations):

(note: the steps below are optional)

Variable sequence length: Compile the model multiple times, with different sequence lengths (ahead of time), load the compiled models to device memory, and then choose at runtime which compiled model to deploy for each input.
Variable batch size: You could apply a similar trick for batch-size as well. Some configurations of BERT (e.g. seqlen 256/512) are quite efficient for batch=1, so the benefit from this step is use-case dependent.
Optimize operators: There's room for optimizing the implementation of some operators (e.g. GELU) without impacting accuracy, to get even more (quite noticeable) performance uplifts. We will share more information on this in upcoming releases.

Note that all these steps are optional, you can definitely keep the single-compilation method and get best-in-class cost per inference (+ benefit from future Neuron performance improvements). But you also have a path to drive cost-per-inference down as much as possible with these more advanced optimization techniques.

Hope this clarifies it, /Ron

djstrong commented 4 years ago

I'm confused about number of models (loaded to device memory) in scenario with variable sequence length/batch-size:

There's a single model with variable sequence-length and batch-size.

Compile the model multiple times, with different sequence lengths (ahead of time), load the compiled models to device memory

How many compiled models with different sequence length will fit to device memory (NeuronCore?)?

aws-diamant commented 4 years ago

The Inferentia device consists of 4 NeuronCores, and has 8GB of device DRAM (which is shared across the 4 cores). So we could fit quite a few compiled models in the device DRAM (BERTBase is ~100M parameters).

djstrong commented 4 years ago

Thank you for your help!

Our use-case is more complicated as we are using a few different models at once. What is more, we are using large versions (not base) of BERT, so they will not fit in DRAM with different sequence lengths each. Is it technically possible in the future, that Inferentia will support variable sequence lengths and batch sizes using one compiled model?

aws-diamant commented 4 years ago

Hey djstrong,

Yes, Inferentia is capable of supporting variable sequence-length / batch-sizes with one compiled model. I discussed with the team, and we will be releasing this capability in one of the next Neuron releases. Once we have a more accurate ETA we will update our roadmap page, and notify you. In the meantime, I would recommend that you pad the inputs to the maximal sequence-length supported by the model. You should be able to achieve best-in-class cost-per-inference even with the padding, and then we'll improve on top of that once we release the variable sequence-length capability.

Btw, if you could share with us more details regarding your expected configuration and performance requirements, we might be able to share more direct recommendations. Feel free to send this off-thread to diamant@amazon.com.

/Ron

AWSGH commented 4 years ago

Hi djstrong: we haven't heard from you in a while, will go ahead and close this issue, feel free to reopen if needed.

karthikgali commented 4 years ago

Hi,

We are testing Pegasus model for Summarization (https://huggingface.co/sshleifer/distill-pegasus-cnn-16-4). As part of inference, we are planning to use EC2 Inf1 instance. I have already tried g4dn.xlarge and it is taking around 1.6 seconds for infering our documents.

Could you please confirm if we can achieve better performance for the above inference using EC2 Inf1 instance.

Please let me know,

Regards, Karthik

AWSGH commented 4 years ago

Hi karthikgali,

The Inf1 instances outperform the g4 family with many models, spanning computer vision, NLP and speech use-cases. While we haven't specifically tried this model, our general experience with transformer based models is that Inf1 will provide around 30% higher throughput, and half the cost per inference, compared to G4.

I suggest you try this specific model on Inf1, and we are here to support in case you have additional questions.

Best regards, Gadi

karthikgali commented 4 years ago

Hi,

Thanks @AWSGH

I have used the following code to compile the model and it was successful in compiling with a lot of warnings

import tensorflow
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distill-pegasus-cnn-16-4")
model = AutoModelForSequenceClassification.from_pretrained("sshleifer/distill-pegasus-cnn-16-4")

example_txt = [
    'this is a sentence in english that we want to summarize',
]

encode_example = tokenizer.encode_plus(example_txt[0],truncation=True, padding='longest',  max_length=1024, pad_to_max_length=True, return_tensors='pt')
example_inputs = encode_example['input_ids'], encode_example['attention_mask'] #, encode_example['token_type_ids']
model_neuron = torch.neuron.trace(model, example_inputs, compiler_args=['-O2'])
model_neuron.save('pegasus_neuron.pt')

When I am trying to infer the sentence. I am getting error.

Inference code:

from transformers import AutoTokenizer, AutoModelWithLMHead

import os
import torch
import torch_neuron

class DistillSummarizeLarge:
    def __init__(self):
        self.device = Config.device
        _cache = Config.mount
        self.tokenizer = AutoTokenizer.from_pretrained(
            "sshleifer/distill-pegasus-cnn-16-4", cache_dir=_cache)
        self.model = torch.jit.load('pegasus_neuron.pt')
        #self.model.eval()

    def __call__(self, text):
        batch = self.tokenizer.encode_plus(
            text, truncation=True, padding='longest',return_tensors="pt")

        batch_new = batch['input_ids'], batch['attention_mask']
        translated = self.model(*batch_new)
        tgt_text = self.tokenizer.batch_decode(
            translated, skip_special_tokens=True)
        return tgt_text

summarizer = DistillSummarizeLarge()
print(summarizer('''It indicates a way to close an interaction, or dismiss a notification. Good Subscriber Account active since Edit my Account Premium Articles Upgrade Membership Email Preferences My Subscription FAQs Logout DOW S&P 500 NASDAQ 100 Close icon Two crossed lines that form an 'X'. It indicates a way to close an interaction, or dismiss a notification. Streaming Software & Apps Smart Home Smartphones Laptops & Tablets Gaming Gadgets More Button Icon Circle with three vertical dots'''))

I am getting the below error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-e49367b7b011> in <module>
      1 summarizer = DistillSummarizeLarge()
      2 
----> 3 print(summarizer('''It indicates a way to close an interaction, or dismiss a notification. Good Subscriber Account active since Edit my Account Premium Articles Upgrade Membership Email Preferences My Subscription FAQs Logout DOW S&P 500 NASDAQ 100 Close icon Two crossed lines that form an 'X'. It indicates a way to close an interaction, or dismiss a notification. Streaming Software & Apps Smart Home Smartphones Laptops & Tablets Gaming Gadgets More Button Icon Circle with three vertical dots'''))

<ipython-input-2-ac180eb64b6e> in __call__(self, text)
     27 
     28         batch_new = batch['input_ids'], batch['attention_mask']
---> 29         translated = self.model(*batch_new)
     30         tgt_text = self.tokenizer.batch_decode(
     31             translated, skip_special_tokens=True)

~/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch_neuron/convert.py", line 74, in forward
    _58 = torch.slice(tensor6, 0, 0, 9223372036854775807, 1)
    tensor7 = torch.slice(_58, 1, 1, 9223372036854775807, 1)
    tensor8 = torch.view(tensor5, [13])
              ~~~~~~~~~~ <--- HERE
    _59 = torch.expand(tensor8, [1, 13], implicit=True)
    _60 = torch.copy_(tensor7, _59, False)

Traceback of TorchScript, original code (most recent call last):
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/resolve_function.py(49): func
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(195): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(85): run_op
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(74): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(106): forward
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(534): _slow_forward
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(548): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/__init__.py(1027): trace_module
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/__init__.py(875): trace
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(87): trace
compile.py(24): <module>
RuntimeError: shape '[13]' is invalid for input of size 87

Could you please look into this issue and help me in resolving this issue.

karthikgali commented 4 years ago

@AWSGH any update on this? Please let us know.

aws-joshim commented 4 years ago

@karthikgali I have created a new github issue https://github.com/aws/aws-neuron-sdk/issues/182 to look into the infer issue that you reported. We are looking into it and will get back to you shortly. Updates will be posted directly on https://github.com/aws/aws-neuron-sdk/issues/182

aws-joshim commented 4 years ago

@karthikgali I have posted an update on the ticket https://github.com/aws/aws-neuron-sdk/issues/182 on the specific issue that you reported. AWS Neuron compiles for fixed tensor sizes. At the moment the model that you are using for compilation is varying tensor shapes. Sample code to address this is detailed on https://github.com/aws/aws-neuron-sdk/issues/182

aws-neuron / aws-neuron-sdk

Tutorial Compiling and Deploying HuggingFace Pretrained BERT is not working #138