NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
502 stars 33 forks source link

How to reproduce the results of mlperf v4.0 llama2 70b sparse? #44

Open DehuaTang opened 3 months ago

DehuaTang commented 3 months ago

Outstanding work! Thanks for the effort you guys put in! Is it really possible to achieve 99.9% accuracy and no fine-tuning for the llama-70b-chat in mlperf task with 2:4 sparse? I have reproduced and tested it using MTO and found that it only achieves 98% accuracy in fp16。 Are you able to give me some suggestions to reproduce this work? Like which hyperparameters need to be adjusted? Is it necessary to use fp8 fine-tuning to achieve 99.9% accuracy?

kaix90 commented 3 months ago

Thank you for your interest. It's feasible to achieve 99.9 ROUGE scores without fine-tuning for the 2:4 sparsified LLaMA-70B-Chat model. We utilized the ModelOpt sparsity package to accomplish this. The key difference may lie in the calibration dataset. We randomly selected a subset from the Open-Orca dataset, ensuring that test samples from MLPERF were excluded.

DehuaTang commented 2 months ago

That's amazing! All the sparse papers with 2:4 sparse don't turn out to be 99.9 accurate. Do you have any plans to give the full code? It would help to reintroduce the industry to the impact of sparse on llm