llama.cpp Integration to Support Low-End Hardware Compatibility

Request for llama.cpp Integration to Support Low-End Hardware Compatibility

Description

I'm currently trying to integrate llama.cpp with Meditron for running models on lower-end hardware. Meditron is based on Llama, so in theory, this should be possible. However, I'm encountering issues when attempting to convert the Meditron model using llama.cpp.

Steps to Reproduce

Either run python3 convert-hf-to-gguf.py ../meditron-7b/

Output:

Loading model: meditron-7b
Traceback (most recent call last):
...
NotImplementedError: Architecture "LlamaForCausalLM" not supported!

Or directly launching with llama.cpp using:

./build/bin/main --rope-freq-scale 8.0 -m ../meditron-7b/pytorch_model-00008-of-00008.bin -p "I have pain in my leg from toes to hip"

Output:

Log start
...
error loading model: llama_model_loader: failed to load model from ../meditron-7b/pytorch_model-00008-of-00008.bin

Expected Behavior

Successful integration of llama.cpp with Meditron, allowing the model to run on lower-end hardware.

Actual Behavior

Encountering a NotImplementedError for the architecture "LlamaForCausalLM" when trying to convert the model, and an error loading the model when launching directly with llama.cpp.

Possible Solution

Adjustments in llama.cpp to support the "LlamaForCausalLM" architecture used by Meditron. This could involve modifying the model conversion script or the model loading mechanism in llama.cpp.

Additional Context

Link to llama.cpp

Request

I kindly request the team to consider adding support for llama.cpp integration with Meditron. Or to give advices on how to implement it. This would be a significant enhancement, enabling the use of Meditron models on more diverse hardware setups, especially those at the lower end.

epfLLM / meditron