EricLBuehler / candle-vllm

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.
MIT License
172 stars 16 forks source link

Support using arbitrary derivative models #34

Closed ivanbaldo closed 2 days ago

ivanbaldo commented 5 months ago

Currently the models need to be specified as llama7b for example, but what if one wants to use codellama/CodeLlama-7b-hf or meta-llama/Llama-2-7b-hf (non chat version), etc.? A more flexible method should be implemented in the future.

EricLBuehler commented 4 months ago

@ivanbaldo, thank you for this idea. Perhaps specifying models via a model ID could be implemented.

h0ru5 commented 4 months ago

This might be easier than the idea I had I was trying to port support for quantized gguf models from this candle example, but am a bit lost bringing it in: https://github.com/huggingface/candle/blob/main/candle-examples/examples/quantized/main.rs

might be also an issue to know the base llama model there to set parameters correctly - I don't know if gguf has all the infos you need in model metadata

EricLBuehler commented 4 months ago

GGUF would be a great addition! However, I am now working on mistral.rs, the successor to this project: https://github.com/EricLBuehler/mistral.rs

Mistral.rs currently has quantized and normal Mistral models, and may be used with arbitrary derivative models. It provides an openai-compatible server and there is a simple chat example.

guoqingbao commented 6 days ago

Currently the models need to be specified as llama7b for example, but what if one wants to use codellama/CodeLlama-7b-hf or meta-llama/Llama-2-7b-hf (non chat version), etc.? A more flexible method should be implemented in the future.

Please also refer to this PR #46 , it can load arbitrary models under the given model architecture.

EricLBuehler commented 2 days ago

@ivanbaldo closing this as we can support loading weights of arbitrary derivative models. Please feel free to reopen!