Open pitsianis opened 1 year ago
I often say: The choice is up to the user.
Experience has shown that having GPU backends as dependencies can cause issues, when one backend is quicker to update than another.
I was looking at option 3, but I am unsure how to set it up for all KA-supported backends.
Great job, by the way. On the first try, I got a non-trivial bitonic sort to perform much better than the ThreadX.sort! on M2 and a Tesla P100.
Is the KA implementation expected to be 15-20% slower than the Metal version? Or am I doing something wrong?
I want to build a standalone module that can run on any supported GPU. How do I detect what packages need to be loaded so that I can have a pattern like