Open atyshka opened 6 months ago
Thank you for the request. We hope such tasks could be covered by the community and we could focus on important model/features supporting.
Hi @byshiue, I agree this would be nice as an open-source collaborative effort. Unfortunately, the community is limited here, as quantization for most formats like FP8 and per-channel smoothquant is now done in the closed-source NVIDIA modelopt tool, so there's no way I could make a PR to output activation scales. A limited subset of quantization is still open-source as part of TRT-LLM, but I'm wondering if that too will be migrated to modelopt in the near future.
If you prefer I could re-file this issue on the modelopt examples repo if you think that is more appropriate.
I could upload weight quantized on Huggingface like The-bloke. I might start this upcoming weekend.
Hi @atyshka, TensorRT Model Optimizer team is aware of this and similar requests. We've started planning on publishing quantized checkpoints and the exported models on HuggingFace model hub.
If you have any specific requirements regarding this, please let me know.
@hchings Any update on this? I haven't seen many other models published on Huggingface aside from the ones by @matichon-vultureprime
Hi @atyshka, we have a few llama models like https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8 uploaded, and we're uploading more (e.g., Medusa checkpoint).
Legal clearance took a while.
TensorRT-LLM has great potential for allowing people to run larger models efficiently with limited hardware resources. Unfortunately, the current quantization workflow requires significant computational resources. An int8/FP8 quant of a 70B model would easily fit in 2x RTX 6000s, but performing such quantization with that hardware would be impossible, as the recommended config is 4x A100/H100 for calibration.
The only solution for those of us without this hardware is renting big instances on AWS/etc., but this seems wasteful for every user to perform individually. Could there be a "model zoo" for the repo where these quants can be stored, similar to how quants are published on Huggingface?
I understand there may be licensing concerns associated with hosting weights for models like Llama. Instead, could we host pre-computed activation scales that can be used to compute quantized weights without excessive GPU memory? I wouldn't see any licensing issue with such a solution.