Open wzLLM opened 2 months ago
Currently exo does pipeline parallel inference which is faster compared to offloading when a single device can't fit the entire model. If a single device can fit the entire model, then just do that -- no need for exo.
We're working on other kinds of parallelism that will improve speed of inference as you add more devices.
Got it, thank you for your answer.
Hello, thank you very much for open-sourcing such an excellent project. I am currently encountering a problem. When I use two Mac computers to run a large model, the speed of inference is no different from that of using one Mac computer. I would like to ask how to solve this problem. Thank you very much.