Closed djstrong closed 4 years ago
Hello djstrong,
Thanks for posting your question. This seems like a dependency issue based on the pip install errors you posted.
We recommend using a python virtual environment, for simpler management of packages. I’d recommend to follow the instructions in https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-install-guide.md, and then try the pip install again. Please make sure to do that from the virtual environment as explained in this link.
You should not see any error reported after running the command below:
$ python3 -m pip install torch-neuron neuron-cc[tensorflow] transformers --upgrade --extra-index-url=https://pip.repos.neuron.amazonaws.com
Please let us know if the issue is resolved, we’d be happy to assist further. Once resolved, we will update the tutorial for better clarity going forward.
/Ron
Thank you.
I have used "AWS Deep Learning AMI (Ubuntu 18.04)" and an environment prepared by AWS: "for PyTorch (+AWS Neuron) with Python3" source activate aws_neuron_pytorch_p36
.
Now, I don't have access to AWS, but I will try on other machine.
Thanks for sharing this information, djstrong!
I was able to reproduce the issue that you're reporting after starting from source activate aws_neuron_pytorch_p36
.
We're looking into this, and will update as soon as we resolve the issue.
As a temporary workaround, I suggest that you skip the source activate aws_neuron_pytorch_p36
step, and follow the instructions in https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-install-guide.md to get the most up-to-date Neuron packages.
Thank you, the workaround works - it saves compiled model.
Inference from tutorial also works. However, the inference time is 3 times longer than on NVIDIA T4.
That's great to hear, djstrong!
Now we're getting to the fun part :) With this BERTBase intro tutorial, we only got the model to compile. We did not yet do performance tuning. First level performance improvement opportunities are listed below.
Use all 4 NeuronCores:
In this tutorial, we are only using a single NeuronCore, out of the 4 NeuronCores available in Inferentia.
So we should enable all 4 cores, to get better parallelism. You can review the following link for relevant info: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/neuron_bert_mrpc_benchmark.ipynb.
I also attached some code snippets at the bottom of my reply, showing how to modify the tutorial code to use multiple NeuronCores. You can adjust num_neuron_cores
to check the impact of adding more cores on performance.
Batching: Batching allows the NeuronCores to read the weights once, cache them, and use them for multiple inferences, thus increasing overall performance. Neuron can typically achieve near-max performance for low batch sizes (2-8), with minimal impact on latency. We definitely encourage you to try a couple of different batch sizes, and find what work best for your use-case. From our experience, BERT models can see significant performance improvements by choosing the optimal batch size.
bert_compile.py:
-----------------
...
# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']
# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron, using optimization level -O2
num_neuron_cores = 4
model_names = ['bert_neuron0.pt', 'bert_neuron1.pt', 'bert_neuron2.pt', 'bert_neuron3.pt']
models = []
for i in range(num_neuron_cores):
if (i == 0):
model = torch.neuron.trace(model, example_inputs_paraphrase, compiler_args=['-O2'])
else:
model = torch.neuron.trace(model, example_inputs_paraphrase, compiler_args=['-O2'], compiler_workdir=compiler_workdir, verbose=1)
model.save(model_names[i])
bert_infer.py:
--------------
import time
import os
from concurrent import futures
...
# Load TorchScript back
model_files = ('bert_neuron0.pt', 'bert_neuron1.pt', 'bert_neuron2.pt', 'bert_neuron3.pt')
models = []
for m in model_files:
model = torch.jit.load(m)
models.append(model)
# Verify the TorchScript works on both example inputs
os.environ['NEURONCORE_GROUP_SIZES'] = '1,1,1,1'
os.environ['NEURON_MAX_NUM_INFERS'] = '-1'
...
classes = ['not paraphrase', 'paraphrase']
num_infer = 300
num_neuron_cores = 4
executor = futures.ThreadPoolExecutor(max_workers=num_neuron_cores)
fut_list = []
start = time.time()
for i in range(num_infer):
for k in range(num_neuron_cores):
fut = executor.submit(models[k], *example_inputs_paraphrase)
fut_list.append(fut)
result_list = [fut.result() for fut in fut_list]
end = time.time()
total_time = (end-start) / (num_infer*num_neuron_cores)
Lastly, please note that more BERT performance improvements are coming in the next release (August), in addition to automated NeuronCore parallelism, aiming to make Neuron more performant, as well as easier to use.
/Ron
Thank you! So, we will probably revisit in August. We are interested in latency for inference of one text (with average length of 512 subtokens).
diff bert_neuron0.pt bert_neuron1.pt
)? Can we load different models to each NeuronCore?Yes, the upcoming August release should be very useful for this use-case. Just to give you an early data-point, I did a quick benchmark for BERTBase-seqlen512-batch1 using the upcoming Aug release, and seeing <10mSec per inference. If time is of the essence, please let me know, and I can try and facilitate an early private preview.
To address your questions:
/Ron
Thank you, it is helpful.
3. Neuron achieves this by compiling for multiple sequence lengths (512/256/128/64), and using the best-fit sequence length at runtime.
So there are 4 separate models or one supporting 4 sequence lengths?
We are interested also in batching, because some texts may be longer than 512 and then they are split to a couple of fragments. Does batching also require compiling for multiple batch sizes?
Could you provide how much memory uses bert-base using one NeuronCore and using 4 NeuronCores (your example) and how much memory is on AWS Inferentia?
There's a single model with variable sequence-length and batch-size.
The entry point would be to compile that model only once, to a fixed sequence length (512) and fixed batch-size (1), and then pad the inputs as needed (note: that's what I did in order to share the expected Aug release performance numbers with you). This alone should achieve <10mSec per inference, and best-in-class cost per inference.
From here, you can optionally apply a set of incremental optimizations that drive the performance higher and thus cost-per-inference lower (sort of 'expert-mode' optimizations):
(note: the steps below are optional)
Variable sequence length: Compile the model multiple times, with different sequence lengths (ahead of time), load the compiled models to device memory, and then choose at runtime which compiled model to deploy for each input.
Variable batch size: You could apply a similar trick for batch-size as well. Some configurations of BERT (e.g. seqlen 256/512) are quite efficient for batch=1, so the benefit from this step is use-case dependent.
Optimize operators: There's room for optimizing the implementation of some operators (e.g. GELU) without impacting accuracy, to get even more (quite noticeable) performance uplifts. We will share more information on this in upcoming releases.
Note that all these steps are optional, you can definitely keep the single-compilation method and get best-in-class cost per inference (+ benefit from future Neuron performance improvements). But you also have a path to drive cost-per-inference down as much as possible with these more advanced optimization techniques.
Hope this clarifies it, /Ron
I'm confused about number of models (loaded to device memory) in scenario with variable sequence length/batch-size:
There's a single model with variable sequence-length and batch-size.
- Compile the model multiple times, with different sequence lengths (ahead of time), load the compiled models to device memory
How many compiled models with different sequence length will fit to device memory (NeuronCore?)?
The Inferentia device consists of 4 NeuronCores, and has 8GB of device DRAM (which is shared across the 4 cores). So we could fit quite a few compiled models in the device DRAM (BERTBase is ~100M parameters).
Thank you for your help!
Our use-case is more complicated as we are using a few different models at once. What is more, we are using large versions (not base) of BERT, so they will not fit in DRAM with different sequence lengths each. Is it technically possible in the future, that Inferentia will support variable sequence lengths and batch sizes using one compiled model?
Hey djstrong,
Yes, Inferentia is capable of supporting variable sequence-length / batch-sizes with one compiled model. I discussed with the team, and we will be releasing this capability in one of the next Neuron releases. Once we have a more accurate ETA we will update our roadmap page, and notify you. In the meantime, I would recommend that you pad the inputs to the maximal sequence-length supported by the model. You should be able to achieve best-in-class cost-per-inference even with the padding, and then we'll improve on top of that once we release the variable sequence-length capability.
Btw, if you could share with us more details regarding your expected configuration and performance requirements, we might be able to share more direct recommendations. Feel free to send this off-thread to diamant@amazon.com.
/Ron
Hi djstrong: we haven't heard from you in a while, will go ahead and close this issue, feel free to reopen if needed.
Hi,
We are testing Pegasus model for Summarization (https://huggingface.co/sshleifer/distill-pegasus-cnn-16-4). As part of inference, we are planning to use EC2 Inf1 instance. I have already tried g4dn.xlarge and it is taking around 1.6 seconds for infering our documents.
Could you please confirm if we can achieve better performance for the above inference using EC2 Inf1 instance.
Please let me know,
Regards, Karthik
Hi karthikgali,
The Inf1 instances outperform the g4 family with many models, spanning computer vision, NLP and speech use-cases. While we haven't specifically tried this model, our general experience with transformer based models is that Inf1 will provide around 30% higher throughput, and half the cost per inference, compared to G4.
I suggest you try this specific model on Inf1, and we are here to support in case you have additional questions.
Best regards, Gadi
Hi,
Thanks @AWSGH
I have used the following code to compile the model and it was successful in compiling with a lot of warnings
import tensorflow
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distill-pegasus-cnn-16-4")
model = AutoModelForSequenceClassification.from_pretrained("sshleifer/distill-pegasus-cnn-16-4")
example_txt = [
'this is a sentence in english that we want to summarize',
]
encode_example = tokenizer.encode_plus(example_txt[0],truncation=True, padding='longest', max_length=1024, pad_to_max_length=True, return_tensors='pt')
example_inputs = encode_example['input_ids'], encode_example['attention_mask'] #, encode_example['token_type_ids']
model_neuron = torch.neuron.trace(model, example_inputs, compiler_args=['-O2'])
model_neuron.save('pegasus_neuron.pt')
When I am trying to infer the sentence. I am getting error.
Inference code:
from transformers import AutoTokenizer, AutoModelWithLMHead
import os
import torch
import torch_neuron
class DistillSummarizeLarge:
def __init__(self):
self.device = Config.device
_cache = Config.mount
self.tokenizer = AutoTokenizer.from_pretrained(
"sshleifer/distill-pegasus-cnn-16-4", cache_dir=_cache)
self.model = torch.jit.load('pegasus_neuron.pt')
#self.model.eval()
def __call__(self, text):
batch = self.tokenizer.encode_plus(
text, truncation=True, padding='longest',return_tensors="pt")
batch_new = batch['input_ids'], batch['attention_mask']
translated = self.model(*batch_new)
tgt_text = self.tokenizer.batch_decode(
translated, skip_special_tokens=True)
return tgt_text
summarizer = DistillSummarizeLarge()
print(summarizer('''It indicates a way to close an interaction, or dismiss a notification. Good Subscriber Account active since Edit my Account Premium Articles Upgrade Membership Email Preferences My Subscription FAQs Logout DOW S&P 500 NASDAQ 100 Close icon Two crossed lines that form an 'X'. It indicates a way to close an interaction, or dismiss a notification. Streaming Software & Apps Smart Home Smartphones Laptops & Tablets Gaming Gadgets More Button Icon Circle with three vertical dots'''))
I am getting the below error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-6-e49367b7b011> in <module>
1 summarizer = DistillSummarizeLarge()
2
----> 3 print(summarizer('''It indicates a way to close an interaction, or dismiss a notification. Good Subscriber Account active since Edit my Account Premium Articles Upgrade Membership Email Preferences My Subscription FAQs Logout DOW S&P 500 NASDAQ 100 Close icon Two crossed lines that form an 'X'. It indicates a way to close an interaction, or dismiss a notification. Streaming Software & Apps Smart Home Smartphones Laptops & Tablets Gaming Gadgets More Button Icon Circle with three vertical dots'''))
<ipython-input-2-ac180eb64b6e> in __call__(self, text)
27
28 batch_new = batch['input_ids'], batch['attention_mask']
---> 29 translated = self.model(*batch_new)
30 tgt_text = self.tokenizer.batch_decode(
31 translated, skip_special_tokens=True)
~/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 result = self._slow_forward(*input, **kwargs)
549 else:
--> 550 result = self.forward(*input, **kwargs)
551 for hook in self._forward_hooks.values():
552 hook_result = hook(self, input, result)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/torch_neuron/convert.py", line 74, in forward
_58 = torch.slice(tensor6, 0, 0, 9223372036854775807, 1)
tensor7 = torch.slice(_58, 1, 1, 9223372036854775807, 1)
tensor8 = torch.view(tensor5, [13])
~~~~~~~~~~ <--- HERE
_59 = torch.expand(tensor8, [1, 13], implicit=True)
_60 = torch.copy_(tensor7, _59, False)
Traceback of TorchScript, original code (most recent call last):
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/resolve_function.py(49): func
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(195): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(85): run_op
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/graph.py(74): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(106): forward
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(534): _slow_forward
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py(548): __call__
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/__init__.py(1027): trace_module
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch/jit/__init__.py(875): trace
/home/ubuntu/anaconda3/envs/aws_neuron_pytorch_p36/lib/python3.6/site-packages/torch_neuron/convert.py(87): trace
compile.py(24): <module>
RuntimeError: shape '[13]' is invalid for input of size 87
Could you please look into this issue and help me in resolving this issue.
@AWSGH any update on this? Please let us know.
@karthikgali I have created a new github issue https://github.com/aws/aws-neuron-sdk/issues/182 to look into the infer issue that you reported. We are looking into it and will get back to you shortly. Updates will be posted directly on https://github.com/aws/aws-neuron-sdk/issues/182
@karthikgali I have posted an update on the ticket https://github.com/aws/aws-neuron-sdk/issues/182 on the specific issue that you reported. AWS Neuron compiles for fixed tensor sizes. At the moment the model that you are using for compilation is varying tensor shapes. Sample code to address this is detailed on https://github.com/aws/aws-neuron-sdk/issues/182
I have tried original tutorial: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.ipynb: