Closed ivanbaldo closed 2 days ago
@ivanbaldo, thank you for this idea. Perhaps specifying models via a model ID could be implemented.
This might be easier than the idea I had I was trying to port support for quantized gguf models from this candle example, but am a bit lost bringing it in: https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized/main.rs
might be also an issue to know the base llama model there to set parameters correctly - I don't know if gguf has all the infos you need in model metadata
GGUF would be a great addition! However, I am now working on mistral.rs, the successor to this project: https://github.com/EricLBuehler/mistral.rs
Mistral.rs currently has quantized and normal Mistral models, and may be used with arbitrary derivative models. It provides an openai-compatible server and there is a simple chat example.
Currently the models need to be specified as
llama7b
for example, but what if one wants to usecodellama/CodeLlama-7b-hf
ormeta-llama/Llama-2-7b-hf
(non chat version), etc.? A more flexible method should be implemented in the future.
Please also refer to this PR #46 , it can load arbitrary models under the given model architecture.
@ivanbaldo closing this as we can support loading weights of arbitrary derivative models. Please feel free to reopen!
Currently the models need to be specified as
llama7b
for example, but what if one wants to usecodellama/CodeLlama-7b-hf
ormeta-llama/Llama-2-7b-hf
(non chat version), etc.? A more flexible method should be implemented in the future.