Open oscarknagg opened 2 years ago
Hi there!
CutlassMLP
) as a baseline, since these avoid unrelated overheads of Python frameworks.
b. Compared to that baseline, I've observed: significant speedups for 64-wide and smaller MLPs, moderate speed-ups for 128-wide MLPs, and no speedup for 256-wide MLPs. RTX 3090. I've hand-tuned the low-level kernel configurations for each of them, so am reasonably confident in this.fully_fused_mlp.cu
need to be tuned to whichever sizes are available.For reference, here are the specs of an A100 vs a 3090:
These numbers are quite similar on a per-SM basis, although the A100 has significantly more SMs. Do you think this would make much difference? (Provided kernel parameters are tuned appropriately)
The speedup in this repo relies on getting the memory traffic close to the chip - in caches/registers etc. This is going to stop working if an MLP is sufficiently large, but I'm unclear where the boundary is.
Does anyone know the answers to these questions:
I could potentially help out with testing (3)