Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
I'm trying to implement a local AI for my home-assistant instance by using my 4 server KVM cluster. Unfortunately the model I want to use only comes as GGUF files (https://huggingface.co/acon96/Home-3B-v3-GGUF) for llama.cpp. Is there any way to convert these models to distributed llama? I don't have access to any current GPU for that job, so I thought I could try to distribute the model to my available compute cluster.
I'm also unsure, if the model itself is compatible at all. As far as I read, Zephyr is based on Mistral and Mixtra and should be compatible according to your README, but I'm fairly new to the AI world and it could be quite overwhelming with all the new terms. So if the question is ridiculous, tell me so. :)
Hi,
I'm trying to implement a local AI for my home-assistant instance by using my 4 server KVM cluster. Unfortunately the model I want to use only comes as GGUF files (https://huggingface.co/acon96/Home-3B-v3-GGUF) for llama.cpp. Is there any way to convert these models to distributed llama? I don't have access to any current GPU for that job, so I thought I could try to distribute the model to my available compute cluster.
I'm also unsure, if the model itself is compatible at all. As far as I read, Zephyr is based on Mistral and Mixtra and should be compatible according to your README, but I'm fairly new to the AI world and it could be quite overwhelming with all the new terms. So if the question is ridiculous, tell me so. :)
BR, RaVoR