Open rustic-snob opened 2 months ago
Thanks for the interest and providing code. I was able to replicate this. The problem size you are using (short sequence length and large inner dimension) is different than what the kernel was tuned for - all of my testing was done with longer contexts of 2048 tokens and smaller model dimensions. You might have to play around with adding additional autotune settings here to get better performance.
Hi! Thank you for your amazing work!
I'm having some trouble on comparing the fused swiglu kernel with its plain pytorch version.
I checked the wall clock time with code below, and it gives me like x 0.5 speed compared to the pytorch one.
Did I do something wrong with this?
Thanks.