ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

GPT-J 6B model #146

Closed timofeev1995 closed 1 year ago

timofeev1995 commented 1 year ago

Hello! Thank you for your framework! I have a question about very large (6B+ models) to convert and serve using your framework. I tried to convert with tips about large models (--fast option etc) but i have CUDA OOM even using a100 40GB NVIDIA card. Is it expected behaviour? Is there any tips to perform conversion of models sized like that? Thank you in advance.

pommedeterresautee commented 1 year ago

sorry for the latency, do you use ONNX Runtime or TensorRT?

CrazyPython commented 1 year ago

When converting the model on ONNX, this happens:

│ /home/james/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1 │
│ 182 in _slow_forward                                                         │
│                                                                              │
│   1179 │   │   │   else:                                                     │
│   1180 │   │   │   │   recording_scopes = False                              │
│   1181 │   │   try:                                                          │
│ ❱ 1182 │   │   │   result = self.forward(*input, **kwargs)                   │
│   1183 │   │   finally:                                                      │
│   1184 │   │   │   if recording_scopes:                                      │
│   1185 │   │   │   │   tracing_state.pop_scope()                             │
│                                                                              │
│ /home/james/.local/lib/python3.10/site-packages/transformers/models/gptj/mod │
│ eling_gptj.py:589 in forward                                                 │
│                                                                              │
│    586 │   │   │   past_length = 0                                           │
│    587 │   │   │   past_key_values = tuple([None] * len(self.h))             │
│    588 │   │   else:                                                         │
│ ❱  589 │   │   │   past_length = past_key_values[0][0].size(-2)              │
│    590 │   │                                                                 │
│    591 │   │   if position_ids is None:                                      │
│    592 │   │   │   position_ids = torch.arange(past_length, input_shape[-1]  │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: Dimension specified as -2 but tensor has no dimensions

When converting the model on TensorRT, I get the same error.

I tried --seq-len 1 128 128, 1 128 2047, 1 2048 2048, and 1 2047 2047 on both onnx and TensorRT, always the same error. I tested on an A100 and on a CPU machine with 128GB RAM.