Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
I'm trying to run model elinas/Llama-3-15B-Instruct-zeroed-ft on a cluster composed of 1 Raspberry Pi 4B 8 Gb and 7 Raspberry Pi 4B 4 Gb. Although I can actually run the model through distributed llama on inference mode, I'm getting some unexpected, strange characters (like Ċ, Ġ and D) in the model's output, as showed below. Do you know why this is happening and how to fix it?
Hello, @b4rtaz!
I'm trying to run model elinas/Llama-3-15B-Instruct-zeroed-ft on a cluster composed of 1 Raspberry Pi 4B 8 Gb and 7 Raspberry Pi 4B 4 Gb. Although I can actually run the model through distributed llama on inference mode, I'm getting some unexpected, strange characters (like Ċ, Ġ and D) in the model's output, as showed below. Do you know why this is happening and how to fix it?