FP8 Linear and conv with cudnn

🚀 Feature

CuDNN provides flexible support for performant gemm/conv with fp8 quantization. Thunder introducing fp8 casts in its traces can benefit from cudnn fusions.

Motivation

Today, thunder uses TE's fp8 linear which delegates quantization strategies to TE, making it opaque to thunder. If thunder plans to handle fp8 casts itself, performant and flexible kernels from cudnn can help.

Cudnn's support is described here: cuDNN's runtime fusion engine

For fp8 specifically, cudnn can provide the following graph as one fused kernel: fp8 drawio

The graph is flexible, meaning:

final output, C, can be in fp16/bf16/fp32.
amax operation can be skipped for static scaling strategies

The corresponding backward graphs are also supported. (Though they require offline transpose on Hopper)

Pitch

Have cudnn executor claim gemm/conv along with the fp8 casts around them.

CC @IvanYashchuk @kshitij12345 @Anerudhan

Lightning-AI / lightning-thunder