It would be great if the LLaMa 2 13B AWQ 4bit quantized model currently used would be upgraded to the Llama 3 8B model. It can be quantized similarly. This would have several advantages:
Llama 3 8B model performs significantly better on all benchmarks
Being an 8B model instead of a 13B model;
it could reduce the VRAM requirement from 8GB to 6GB, enabling popular GPUs like the RTX
3050, RTX 3060 Laptop and RTX 4050 Laptop to run this demo.
It would be more than 50% faster due to the reduction in parameter count.
It would be great if the LLaMa 2 13B AWQ 4bit quantized model currently used would be upgraded to the Llama 3 8B model. It can be quantized similarly. This would have several advantages:
The models are available at: https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6