NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.43k stars 953 forks source link

BERT model can't be converted #998

Open whk6688 opened 8 months ago

whk6688 commented 8 months ago

System Info

cudn12.2

Who can help?

No response

Information

Tasks

Reproduction

when i run bert exmple by following command: nohup python3 build.py --dtype=float16 --log_level=verbose > t2.log 2>&1 &

Expected behavior

convert to onnx model. now folder is empty

actual behavior

[01/26/2024-18:54:41] [TRT] [V] After Myelin optimization: 1 layers [01/26/2024-18:54:41] [TRT] [V] Applying ScaleNodes fusions. [01/26/2024-18:54:41] [TRT] [V] After scale fusion: 1 layers [01/26/2024-18:54:41] [TRT] [V] After dupe layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After final dead-layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After tensor merging: 1 layers [01/26/2024-18:54:41] [TRT] [V] After vertical fusions: 1 layers [01/26/2024-18:54:41] [TRT] [V] After dupe layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [W] [RemoveDeadLayers] Input Tensor input_lengths is unused or used only at compile-time, but is not being removed. [01/26/2024-18:54:41] [TRT] [V] After final dead-layer removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After tensor merging: 1 layers [01/26/2024-18:54:41] [TRT] [V] After slice removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] After concat removal: 1 layers [01/26/2024-18:54:41] [TRT] [V] Trying to split Reshape and strided tensor [01/26/2024-18:54:41] [TRT] [V] Graph optimization time: 0.0770116 seconds. [01/26/2024-18:54:41] [TRT] [V] Building graph using backend strategy 2 [01/26/2024-18:54:41] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [01/26/2024-18:54:41] [TRT] [V] Constructing optimization profile number 0 [1/1]. [01/26/2024-18:54:41] [TRT] [V] Applying generic optimizations to the graph for inference. [01/26/2024-18:54:42] [TRT] [V] Reserving memory for host IO tensors. Host: 0 bytes [01/26/2024-18:54:42] [TRT] [V] =============== Computing costs for {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} [01/26/2024-18:54:42] [TRT] [V] Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Float(( 1024 input_len),1024,1) *** [01/26/2024-18:54:42] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:44] [TRT] [V] [MemUsageChange] Subgraph create: CPU +1415, GPU +1700, now: CPU 5966, GPU 8997 (MiB) [01/26/2024-18:54:46] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [shape.cpp:verify_output_type:1274] Mismatched type for tensor BertModel/layers/0/attention/qkv/MATRIX_MULTIPLY_0_output_0', f32 vs. expected type:f16. [01/26/2024-18:54:46] [TRT] [V] {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) profiling completed in 4.61707 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf [01/26/2024-18:54:46] [TRT] [V] * Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Half((* 1024 input_len),1024,1) *** [01/26/2024-18:54:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:46] [TRT] [V] [MemUsageChange] Subgraph create: CPU +38, GPU +0, now: CPU 4608, GPU 8868 (MiB)

additional notes

None

byshiue commented 8 months ago

Could you share the full log? I don't see the error log here. If the program is crash directly, it might require more RAM to build the engine.

whk6688 commented 8 months ago

i catch the screenshot , gpu is full

F9qxEcsk91

whk6688 commented 8 months ago

there is no error, info file end with:

[01/26/2024-18:54:44] [TRT] [V] [MemUsageChange] Subgraph create: CPU +1415, GPU +1700, now: CPU 5966, GPU 8997 (MiB) [01/26/2024-18:54:46] [TRT] [E] 9: Skipping tactic 0x0000000000000000 due to exception [shape.cpp:verify_output_type:1274] Mismatched type for tensor BertModel/layers/0/attention/qkv/MATRIX_MULTIPLY_0_output_0', f32 vs. expected type:f16. [01/26/2024-18:54:46] [TRT] [V] {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) profiling completed in 4.61707 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf [01/26/2024-18:54:46] [TRT] [V] *** Autotuning format combination: Int32(input_len,1), Int32(input_len,1) -> Half((* 1024 input_len),1024,1) *** [01/26/2024-18:54:46] [TRT] [V] --------------- Timing Runner: {ForeignNode[BertModel/embedding/position_embedding/CONSTANT_0...BertModel/layers/23/post_layernorm/NORMALIZATION_0]} (Myelin[0x80000023]) [01/26/2024-18:54:46] [TRT] [V] [MemUsageChange] Subgraph create: CPU +38, GPU +0, now: CPU 4608, GPU 8868 (MiB)

then program exit

whk6688 commented 8 months ago

i think you are right. is there any way to reduce memory usage?

whk6688 commented 8 months ago

oh, the last is:

image

byshiue commented 8 months ago

We don't have way to reduce the RAM usage now.

Muhtasham commented 8 months ago

@whk6688 try reducing max_batch_size in build.py I set it to 128 and it fit to 4GB RAM