Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
Hi,
looking at your great project, I see that the model nust be in .bin format, but following your instruction, the convert-llama.py create a ".m" file and not ".bin".
Hello @fabgat, from last few versions Distributed Llama converts models to .m format. But only the extension has changed. The binary format is the same. You can still use .bin models.
Hi, looking at your great project, I see that the model nust be in .bin format, but following your instruction, the convert-llama.py create a ".m" file and not ".bin".
Do I am missing some step?
Cheers