Open taozhang9527 opened 12 months ago
Since different TP-way would change the computation order of GEMM, it is expected to observe different scores. You can run more iterations (like --max_ite 1000
) get more stable result.
Thank you. This means that the original model accuracy can vary after build, which is surprising.
Thank you. This means that the original model accuracy can vary after build, which is surprising.
- In this example, the differences were observed with TP, how about the Pipeline Parallelism?
- Do you suggest always to re-evaluate the model after the build?
- Summerize.py only verifies the summarization capability after the build. Does NVIDIA provide the benchmark scripts to verify other capabilities, e.g, reasoning, analysis etc.
Testing rouge accuracy for Llama-2-70b-chat-hf model with summarize.py with fp8 quantization for 2-GPUs and 4-GPUs respectively.
Quantization follows the instructions here.
Build commands: For 2 GPUs:
python examples/llama/build.py --model_dir Llama-2-70b-chat-hf \ --quantized_fp8_model_path Llama-2-70b-chat-hf_quantized_fp8/llama_tp1_rank0.npz \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --output_dir examples/llama/out/70b/fp8_2gpu \ --remove_input_padding \ --enable_fp8 \ --fp8_kv_cache \ --world_size 2 \ --tp_size 2
For 4 GPUs:
python examples/llama/build.py --model_dir Llama-2-70b-chat-hf \ --quantized_fp8_model_path Llama-2-70b-chat-hf_quantized_fp8/llama_tp1_rank0.npz \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --use_gemm_plugin float16 \ --output_dir examples/llama/out/70b/fp8_4gpu \ --remove_input_padding \ --enable_fp8 \ --fp8_kv_cache \ --world_size 4 \ --tp_size 4
Benchmark with summerize.py: For 2 GPUs:
mpirun -n 2 --allow-run-as-root python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-70b-chat-hf --data_type fp16 --engine_dir examples/llama/out/70b/fp8_2gpu/
Results: [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 28.41303469006552 [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 9.459688427598037 [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 21.09760668757889 [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 23.66656895231521
For 4 GPUs:
mpirun -n 4 --allow-run-as-root python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-70b-chat-hf --data_type fp16 --engine_dir examples/llama/out/70b/fp8_4gpu/
Results: [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 29.83943705196579 [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 9.043187238064533 [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 22.147646131713454 [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 24.574244292240255
Questions: With the same model but running at different numbers of GPUs, why there are obvious accuracy differences as shown above? Why the hardware configuration would affect the model accuracy?
these scores are over all the articles in test set or for single article only?
these scores are over all the articles in test set or for single article only?
By default, the test runs 20 articles.
I tried the --max_ite
option with the Llama 7b model for 1 GPU scenario. It seems the script will run it in a batch mode. Any iteration larger than 70 will generate CUDA out of memory error.
10 iterations
python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-7b-chat-hf --data_type fp16 --engine_dir examples/llama/out/7b/fp16_1gpu/ --max_ite 10
[12/02/2023-00:52:00] [TRT-LLM] [I] TensorRT-LLM (total latency: 19.241528749465942 sec)
[12/02/2023-00:52:00] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[12/02/2023-00:52:01] [TRT-LLM] [I] rouge1 : 31.51783972393539
[12/02/2023-00:52:01] [TRT-LLM] [I] rouge2 : 12.962085992138302
[12/02/2023-00:52:01] [TRT-LLM] [I] rougeL : 22.78973749166779
[12/02/2023-00:52:01] [TRT-LLM] [I] rougeLsum : 26.209055650525226
30 iterations
python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-7b-chat-hf --data_type fp16 --engine_dir examples/llama/out/7b/fp16_1gpu/ --max_ite 30
[12/02/2023-00:49:13] [TRT-LLM] [I] TensorRT-LLM (total latency: 57.63183331489563 sec)
[12/02/2023-00:49:13] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[12/02/2023-00:49:13] [TRT-LLM] [I] rouge1 : 28.73348811504518
[12/02/2023-00:49:13] [TRT-LLM] [I] rouge2 : 9.470286845031945
[12/02/2023-00:49:13] [TRT-LLM] [I] rougeL : 20.088190753161435
[12/02/2023-00:49:13] [TRT-LLM] [I] rougeLsum : 23.70223818328634
60 iterations
python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-7b-chat-hf --data_type fp16 --engine_dir examples/llama/out/7b/fp16_1gpu/ --max_ite 50
[12/02/2023-00:45:52] [TRT-LLM] [I] TensorRT-LLM (total latency: 115.86576294898987 sec)
[12/02/2023-00:45:52] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[12/02/2023-00:45:52] [TRT-LLM] [I] rouge1 : 28.65772650686712
[12/02/2023-00:45:52] [TRT-LLM] [I] rouge2 : 9.53579634159262
[12/02/2023-00:45:52] [TRT-LLM] [I] rougeL : 19.860040871328565
[12/02/2023-00:45:52] [TRT-LLM] [I] rougeLsum : 23.49220177494702
What is min iteration recommended?
For quantized model, if running at 4-GPUs, it gives accuracy 0. 1-GPU seems ok.
Using Llama 7b gptq model as example,
Running at 1 GPU with 60 iterations
python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-7b-chat-hf --data_type fp16 --engine_dir examples/llama/out/7b/gptq_1gpu/ --max_ite 60
[12/02/2023-01:01:15] [TRT-LLM] [I] TensorRT-LLM (total latency: 41.35520267486572 sec)
[12/02/2023-01:01:15] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[12/02/2023-01:01:15] [TRT-LLM] [I] rouge1 : 26.726988823529794
[12/02/2023-01:01:15] [TRT-LLM] [I] rouge2 : 8.050453201584101
[12/02/2023-01:01:15] [TRT-LLM] [I] rougeL : 19.21217015937522
[12/02/2023-01:01:15] [TRT-LLM] [I] rougeLsum : 22.25121143267094
Running at 4 GPU with 60 iterations
mpirun -n 4 --allow-run-as-root python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-7b-chat-hf --data_type fp16 --engine_dir examples/llama/out/7b/gptq_4gpu/ --max_ite 60
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM (total latency: 27.26634407043457 sec)
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 0.0
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 0.0
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 0.0
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 0.0
You could add
from tensorrt_llm.profiler import print_memory_usage
print_memory_usage()
to print the memory usage.
I take a try and does not find issue
python ../summarize.py --engine_dir "/tmp/new_13b/trt_engines/fp16/1-gpu/" --test_trt_llm --max_ite 70 --tokenizer_dir llama-v2-13b-hf/
[12/05/2023-08:39:02] [TRT-LLM] [I] Load tokenizer takes: 0.07715773582458496 sec
[12/05/2023-08:39:13] [TRT] [I] Loaded engine size: 24831 MiB
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 25324, GPU 25421 (MiB)
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +3, GPU +72, now: CPU 25327, GPU 25493 (MiB)
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24825, now: CPU 0, GPU 24825 (MiB)
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 25366, GPU 36321 (MiB)
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +64, now: CPU 25366, GPU 36385 (MiB)
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 24825 (MiB)
[12/05/2023-08:39:15] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 25409, GPU 36495 (MiB)
[12/05/2023-08:39:16] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +72, now: CPU 25410, GPU 36567 (MiB)
[12/05/2023-08:39:16] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 24825 (MiB)
[12/05/2023-08:39:16] [TRT-LLM] [I] Load engine takes: 12.31963038444519 sec
[12/05/2023-08:39:19] [TRT-LLM] [I] ---------------------------------------------------------
[12/05/2023-08:39:19] [TRT-LLM] [I] TensorRT-LLM Generated :
[12/05/2023-08:39:19] [TRT-LLM] [I] Input : ['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \'em and stuff \'em!" upon making an arrest. Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \'Return of the Killer Shrews,\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\'ve lost in 2015 . CNN\'s Stella Chan contributed to this story.']
[12/05/2023-08:39:19] [TRT-LLM] [I]
Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[12/05/2023-08:39:19] [TRT-LLM] [I]
Output : [['James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88.\n(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief']]
[12/05/2023-08:39:19] [TRT-LLM] [I] ---------------------------------------------------------
[12/05/2023-08:39:21] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6782 (GiB) Device 36.6030 (GiB)
^Tte[12/05/2023-08:39:38] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6813 (GiB) Device 37.3842 (GiB)
[12/05/2023-08:39:56] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6839 (GiB) Device 37.3842 (GiB)
[12/05/2023-08:40:14] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6849 (GiB) Device 37.3842 (GiB)
[12/05/2023-08:40:30] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6897 (GiB) Device 37.3842 (GiB)
[12/05/2023-08:40:50] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6916 (GiB) Device 37.3842 (GiB)
[12/05/2023-08:41:11] [TRT-LLM] [I] [MemUsage] Allocated Memory: Host 1.6930 (GiB) Device 37.3842 (GiB)
[12/05/2023-08:41:32] [TRT-LLM] [I] TensorRT-LLM (total latency: 130.97906184196472 sec)
[12/05/2023-08:41:32] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[12/05/2023-08:41:32] [TRT-LLM] [I] rouge1 : 20.9531536968386
[12/05/2023-08:41:32] [TRT-LLM] [I] rouge2 : 6.1729628460231645
[12/05/2023-08:41:32] [TRT-LLM] [I] rougeL : 15.237700671631076
[12/05/2023-08:41:32] [TRT-LLM] [I] rougeLsum : 18.311383327456443
is the score of rouge1, rouge2, rougeL, rougeLsum the higher the better?
In most time, higher scores are better. But ideally we hope the scores are closers to our reference but not higher than our reference.
In most time, higher scores are better. But ideally we hope the scores are closers to our reference but not higher than our reference.
Thanks for reply. By the way,the reference here refers to HF results?
It depends on what reference model you use. In most of our example, we use HF model.
@byshiue
Could you comment the result I reported earlier that multi-CPU gives 0 rouge accuracy as the following?
Running at 4 GPU with 60 iterations
mpirun -n 4 --allow-run-as-root python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-7b-chat-hf --data_type fp16 --engine_dir examples/llama/out/7b/gptq_4gpu/ --max_ite 60
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM (total latency: 27.26634407043457 sec)
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 0.0
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 0.0
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 0.0
[12/02/2023-01:04:27] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 0.0
It seems that the summarize.py
script does not work anymore in the latest 0.6.1 release. Could you take a look at #635.
It seems that the
summarize.py
script does not work anymore in the latest 0.6.1 release. Could you take a look at #635.
Old version summary.py is in /TensorRT-LLM/examples/summarize.py
Thank you. The old summary.py works fine. What is the usage of the summarize_long.py
under the llama folder?
@taozhang9527 And for TP4-GPTQ-LLAMA2-7b accuracy problem, please let me know your instructions of gptq quantization and engine building please. It works fine from my side with TP2
Let me run it again, as I am in 0.6.1 now.
Tried the GPTQ for TP=4 for Llama2-70b, summarize.py seems working fine.
[12/12/2023-20:40:41] [TRT-LLM] [MPI_Rank 0] [I] ---------------------------------------------------------
[12/12/2023-20:41:54] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM (total latency: 71.13045263290405 sec)
[12/12/2023-20:41:54] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result
[12/12/2023-20:41:54] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 13.626381781752045
[12/12/2023-20:41:54] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 2.714765540315122
[12/12/2023-20:41:54] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 10.152512207095684
[12/12/2023-20:41:54] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 12.104668022927967
For Llama2-7b, I was able to build engines for TP=1 and TP=2, but TP=4 and TP=8 will fail with the 0.6.1 release code, so I am not able to test. The command and error info are listed in #648.
Testing rouge accuracy for Llama-2-70b-chat-hf model with summarize.py with fp8 quantization for 2-GPUs and 4-GPUs respectively.
Quantization follows the instructions here.
Build commands: For 2 GPUs:
For 4 GPUs:
Benchmark with summerize.py: For 2 GPUs:
mpirun -n 2 --allow-run-as-root python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-70b-chat-hf --data_type fp16 --engine_dir examples/llama/out/70b/fp8_2gpu/
Results: [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 28.41303469006552 [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 9.459688427598037 [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 21.09760668757889 [11/14/2023-23:36:40] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 23.66656895231521
For 4 GPUs:
mpirun -n 4 --allow-run-as-root python3 examples/llama/summarize.py --test_trt_llm --hf_model_location Llama-2-70b-chat-hf --data_type fp16 --engine_dir examples/llama/out/70b/fp8_4gpu/
Results: [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] TensorRT-LLM beam 0 result [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rouge1 : 29.83943705196579 [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rouge2 : 9.043187238064533 [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rougeL : 22.147646131713454 [11/14/2023-23:20:24] [TRT-LLM] [MPI_Rank 0] [I] rougeLsum : 24.574244292240255
Questions: With the same model but running at different numbers of GPUs, why there are obvious accuracy differences as shown above? Why the hardware configuration would affect the model accuracy?