Feature Request: "Model Zoo" for quantization

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.77k stars 1k forks source link

Feature Request: "Model Zoo" for quantization #1591

Open atyshka opened 6 months ago

atyshka commented 6 months ago

TensorRT-LLM has great potential for allowing people to run larger models efficiently with limited hardware resources. Unfortunately, the current quantization workflow requires significant computational resources. An int8/FP8 quant of a 70B model would easily fit in 2x RTX 6000s, but performing such quantization with that hardware would be impossible, as the recommended config is 4x A100/H100 for calibration.

The only solution for those of us without this hardware is renting big instances on AWS/etc., but this seems wasteful for every user to perform individually. Could there be a "model zoo" for the repo where these quants can be stored, similar to how quants are published on Huggingface?

I understand there may be licensing concerns associated with hosting weights for models like Llama. Instead, could we host pre-computed activation scales that can be used to compute quantized weights without excessive GPU memory? I wouldn't see any licensing issue with such a solution.

byshiue commented 6 months ago

Thank you for the request. We hope such tasks could be covered by the community and we could focus on important model/features supporting.

atyshka commented 6 months ago

Hi @byshiue, I agree this would be nice as an open-source collaborative effort. Unfortunately, the community is limited here, as quantization for most formats like FP8 and per-channel smoothquant is now done in the closed-source NVIDIA modelopt tool, so there's no way I could make a PR to output activation scales. A limited subset of quantization is still open-source as part of TRT-LLM, but I'm wondering if that too will be migrated to modelopt in the near future.

If you prefer I could re-file this issue on the modelopt examples repo if you think that is more appropriate.

matichon-vultureprime commented 6 months ago

I could upload weight quantized on Huggingface like The-bloke. I might start this upcoming weekend.

matichon-vultureprime commented 6 months ago

Uploading.

First model card !!

TheFloat16/Llama3-70b

hchings commented 6 months ago

Hi @atyshka, TensorRT Model Optimizer team is aware of this and similar requests. We've started planning on publishing quantized checkpoints and the exported models on HuggingFace model hub.

If you have any specific requirements regarding this, please let me know.

atyshka commented 3 months ago

@hchings Any update on this? I haven't seen many other models published on Huggingface aside from the ones by @matichon-vultureprime

hchings commented 2 weeks ago

Hi @atyshka, we have a few llama models like https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8 uploaded, and we're uploading more (e.g., Medusa checkpoint).

Legal clearance took a while.