ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

Optimizations for T0 #102

Closed michaelroyzen closed 2 years ago

michaelroyzen commented 2 years ago

I'm trying to replicate the T5 ONNX optimization notebook (the latest version, on the feat/t5_3b branch) but for T0_3B (which in itself is a derivative of T5, but with a slightly different config and no tie_word_embeddings.

I installed ONNX runtime from source as described in the notebook.

The only changes I made to the notebook are replacing "t5-3b" with "bigscience/T0_3B", and commenting out out_dec["last_hidden_state"] = out_dec["last_hidden_state"] * (pytorch_model.model_dim**-0.5) in the ExportT5 class, as T0 does not use tie word embeddings.

However, the notebook fails on dec_if_ort_model = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3), with the error: Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from ./test-dec-if/model.onnx failed:This is an invalid model. Error: the graph is not acyclic.

Shouldn't T0 work because it is essentially T5? Your help would be greatly appreciated @pommedeterresautee. Thanks!

pommedeterresautee commented 2 years ago

Hi @michaelroyzen ,

I ran it on my side and it worked perfectly (with the same modifications than those you did).

Closest model to T0 is t5_v1-1 as it uses gated RELU instead of RELU (T5) in FF step. I have also tried with google/t5-v1_1-small for a rapid test and it also worked perfectly.

Can you retry with the small T5_v1-1?

FWIW I have onnx 1.11.0 installed (onnx lib is used for the graph manipulation).

Btw, one fun thing is that Pytorch generated text is crazy (random subtoken) and Onnx works well. It was puzzling me until I understood it's because Pytorch model doesn't support the .half() call, meaning there are probably weights outside the FP16 range.

michaelroyzen commented 2 years ago

Thanks for getting back to me so quickly @pommedeterresautee. Are you using onnx 1.11.0 or onnxruntime 1.11.0? I've been building onnxruntime from source using the v1.12 commit hash provided in the notebook.

And yeah, calling .half() on PyTorch T5/T0 models breaks them because they were trained in bf16.

pommedeterresautee commented 2 years ago

I really meant onnx library not onnx runtime. It's the one used to manipulate computation graph and it may be key in the merge process.

michaelroyzen commented 2 years ago

Gotcha, we do use onnx 1.11.0. Just tried with google/t5-v1_1-small and got InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./test-dec-if/model.onnx failed:This is an invalid model. on dec_if_ort_model = create_model_for_provider(dec_if_model_path, "CUDAExecutionProvider", log_severity=3).

One thing to note is we are using the nvcr.io/nvidia/pytorch:22.04-py3 Docker container with CUDA 11.6 to build the onnxruntime -- we tried building the onnxruntime with nvcr.io/nvidia/pytorch:21.07-py3 which has CUDA 11.4, but then we get import issues when trying to use it in the notebook.

pommedeterresautee commented 2 years ago

Would it be possible to set log severity to 0 and share logs?

I did my test locally, Ubuntu 22.04, CUDA 11.7 and any Nvidia dependency up to date basically. Instinctively I think it's not related to CUDA drivers, etc. It's something related to the graph.

If onnx is not the cause, may be Pytorch ? I would test with NVC 22.02 which seem to be the last one to be built with Pytorch 1.11 (22.04 is based on a 1.12 nightly): https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04

image
michaelroyzen commented 2 years ago

I got it to work just now actually (export successfully to ONNX) using CUDA 11.4 from the 21.08 image and then manually bumping ONNX to 1.11.0 and PyTorch to 1.11. Thanks!

Now I'm dealing with CUDA OOM when trying to run the benchmarks (hardware: A10 GPU w/ 24GB VRAM), it seems that loading enc_fp16_onnx = create_model_for_provider(encoder_fp16_model_path, "CUDAExecutionProvider", log_severity=3) enc_fp16_onnx_binding: IOBinding = enc_fp16_onnx.io_binding() dec_onnx = create_model_for_provider(dec_if_fp16_model_path, "CUDAExecutionProvider", log_severity=3) dec_onnx_binding: IOBinding = dec_onnx.io_binding() takes up all of GPU memory.

pommedeterresautee commented 2 years ago

The issue is with past states which of course become bigger and bigger with model size + beam search + long seq len.

With the code on the repo, the best you can do is 128 tokens with beam search 2 on 24Gb RAM (at least on a 3090RTX).

We are working on reducing memory footprint, right now by a quite simple modif of the T5 code in transformers (that we plan to apply through monkey patching) we are able to double the max beam search (or probably seq len, not tried :-)).

We think there is a way to go much further in RAM consumption for large models by leveraging EL-Attention: https://arxiv.org/abs/2105.04779

Some code has been open sourced (fastseq https://github.com/microsoft/fastseq) and we have been able to verify the benchmarks of the paper (including on T5). Unfortunately the open source code is written in such a way that it needs some refactoring to be used IRL (and//or on recent version of transformers lib).

Is it something you would be interested to work on? if yes we could share with you know how we think the code can be adapted.

michaelroyzen commented 2 years ago

Is the reason that the ONNX model is ~2x the memory size of the PyTorch one because of the IF node merging the decoders with and without cache? If so, I'd be interested in discarding the version without cache to save that memory.

Fastseq looks interesting, it'd be great if you could share how you think it could be adapted.

pommedeterresautee commented 2 years ago

Hi, it seems that the official Pytorch 1.12 breaks GPT-2 unit test and probably T5. We will dig into it. Regarding the fastseq thing, after analyzing the few thousands of loc in the repo, it appeared that the modification to get most of the perf are quite light. For now, our approach is to monkey patch source code. For that we compare fastseq source code with transformers version 4.12.5.

To test the approach we have isolated a small (in LoC) optimization of T5 responsible of 50% of the whole speed-up. It's in #103 .

Next step is to apply the same approach for the cache deduplicating thing related to beam search. Let us know if you need more info.

michaelroyzen commented 2 years ago

Running the new t5_bf16 notebook with T0 results in this error when exporting the encoder:

Using onnx 1.11, onnxruntime 1.12 compiled as per your instructions here, CUDA 11.4, PyTorch 1.10

`RuntimeError Traceback (most recent call last) /tmp/ipykernel_632/3381225355.py in ----> 1 convert_to_onnx( 2 model_pytorch=pytorch_model.encoder, 3 output_path=encoder_model_path, 4 inputs_pytorch={"input_ids": input_ids}, 5 var_output_seq=True,

/opt/conda/lib/python3.8/site-packages/transformer_deploy/backends/pytorch_utils.py in convert_to_onnx(model_pytorch, output_path, inputs_pytorch, quantization, var_output_seq, output_names) 151 input_names = list(inputs_pytorch.keys()) 152 with torch.no_grad(): --> 153 torch.onnx.export( 154 model_pytorch, # model to optimize 155 args=tuple(inputs_pytorch.values()), # tuple of multiple inputs

/opt/conda/lib/python3.8/site-packages/torch/onnx/init.py in export(model, args, f, export_params, verbose, training, input_names, output_names, aten, operator_export_type, opset_version, _retain_param_name, do_constant_folding, example_outputs, strip_doc_string, dynamic_axes, keep_initializers_as_inputs, custom_opsets, enable_onnx_checker, use_external_data_format) 303 304 from torch.onnx import utils --> 305 return utils.export(model, args, f, export_params, verbose, training, 306 input_names, output_names, aten, 307 operator_export_type, opset_version, _retain_param_name,

/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py in export(model, args, f, export_params, verbose, training, input_names, output_names, aten, operator_export_type, opset_version, _retain_param_name, do_constant_folding, example_outputs, strip_doc_string, dynamic_axes, keep_initializers_as_inputs, custom_opsets, enable_onnx_checker, use_external_data_format) 85 else: 86 operator_export_type = OperatorExportTypes.ONNX ---> 87 _export(model, args, f, export_params, verbose, training, input_names, output_names, 88 operator_export_type=operator_export_type, opset_version=opset_version, 89 _retain_param_name=_retain_param_name, do_constant_folding=do_constant_folding,

/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py in _export(model, args, f, export_params, verbose, training, input_names, output_names, operator_export_type, export_type, example_outputs, opset_version, _retain_param_name, do_constant_folding, strip_doc_string, dynamic_axes, keep_initializers_as_inputs, fixed_batch_size, custom_opsets, add_node_names, enable_onnx_checker, use_external_data_format, onnx_shape_inference) 690 691 graph, params_dict, torch_out = \ --> 692 _model_to_graph(model, args, verbose, input_names, 693 output_names, operator_export_type, 694 example_outputs, _retain_param_name,

/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py in _model_to_graph(model, args, verbose, input_names, output_names, operator_export_type, example_outputs, _retain_param_name, do_constant_folding, _disable_torch_constant_prop, fixed_batch_size, training, dynamic_axes) 468 params_dict = _get_named_param_dict(graph, params) 469 --> 470 graph = _optimize_graph(graph, operator_export_type, 471 _disable_torch_constant_prop=_disable_torch_constant_prop, 472 fixed_batch_size=fixed_batch_size, params_dict=params_dict,

/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py in _optimize_graph(graph, operator_export_type, _disable_torch_constant_prop, fixed_batch_size, params_dict, dynamic_axes, input_names, module) 196 dynamic_axes = {} if dynamic_axes is None else dynamic_axes 197 torch._C._jit_pass_onnx_set_dynamic_input_shape(graph, dynamic_axes, input_names) --> 198 graph = torch._C._jit_pass_onnx(graph, operator_export_type) 199 torch._C._jit_pass_lint(graph) 200

/opt/conda/lib/python3.8/site-packages/torch/onnx/init.py in _run_symbolic_function(*args, kwargs) 346 def _run_symbolic_function(*args, *kwargs): 347 from torch.onnx import utils --> 348 return utils._run_symbolic_function(args, kwargs) 349 350

/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py in _run_symbolic_function(g, block, n, inputs, env, operator_export_type) 995 return None 996 attrs = {k: n[k] for k in n.attributeNames()} --> 997 return symbolic_fn(g, *inputs, **attrs) 998 999 elif ns == "prim":

/opt/conda/lib/python3.8/site-packages/torch/onnx/symbolic_helper.py in wrapper(g, *args, *kwargs) 170 if len(kwargs) == 1: 171 assert "_outputs" in kwargs --> 172 return fn(g, args, **kwargs) 173 174 return wrapper

/opt/conda/lib/python3.8/site-packages/torch/onnx/symbolic_opset9.py in embedding(g, weight, indices, padding_idx, scale_grad_by_freq, sparse) 485 "ONNX does not support not updating the embedding vector at padding_idx during training.") 486 --> 487 return g.op("Gather", weight, indices) 488 489

/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py in _graph_op(g, opname, *raw_args, **kwargs) 891 if _onnx_shape_inference: 892 from torch.onnx.symbolic_helper import _export_onnx_opset_version as opset_version --> 893 torch._C._jit_pass_onnx_node_shape_type_inference(n, _params_dict, opset_version) 894 895 if outputs == 1:

RuntimeError: unexpected tensor scalar type`