Closed vineel96 closed 3 months ago
Hi, thanks for the question!
Unfortunately, we do not have CPU support.
Thanks for the reply @HanGuo97. Is there any plan/possibility to extend this work on CPU's ? also any references/links in this regard is helpful. Thanks.
FLUTE makes a few assumptions about the hardware platform. Because of this, FLUTE is designed specifically for NVIDIA GPUs. We do have near term plans to extend FLUTE to additional Ampere/Ada-generation GPUs, but CPU is, unfortunately, not in our plan at the moment. We are happy to help out if you are interested in extending to CPUs, though.
As for the references, what are kind of references are you looking for?
@HanGuo97,
I think some of the "ideas" used in FLUTE might be useful for CPUs. For example, offline partitioning to reduce runtime re-ordering before hardware-accelerated matmul intrinsics --- although I'm not familiar with what kind of intrinsics are available in CPUs. I had very little high performance CPU programming experience, so it's a bit hard for me to judge.
The lineup of works related to tensor compiler (also referred as ML compilation) could be useful. For example, this is a very good reference: https://mlc.ai/
@HanGuo97 Thanks for the insights and links, will get back to you if I have any further doubts.
Great to hear!
Hello @HanGuo97, This fast matrix multiplication kernel works on CPU's also? Have you done any experiments on CPU's, if any?