Open stewartugelow opened 3 months ago
Thanks!
I have been thinking about that as well.
What are your thoughts?
Btw, distributed inference is on the roadmap:
I gave it a quick Look but will dig deeper throughout the week.
It’s an area I’ve never explored before so the code doesn’t make much sense to me yet. I can’t imagine there are that many people who really need to run a llama 70b model locally, but if it accelerated token response rates it would be amazing!
(Higher bit quants might be very valuable, too.)
Working a custom solution for it.
It will take a bit more time because my plate is pretty full and I don't have multiple powerful machines to test bigger models.
Exo Labs just published a repository on how to create an MLX cluster across your Apple devices. (Including iOS apparently!)
https://github.com/exo-explore/exo
https://x.com/ac_crypto/status/1812949425465270340?s=46&t=3addBdiItmeUbNMMQJTR4w