Isn't the memory consumption should be dropped when using fp8?

Your memory usage is expected, but it also depends on your workflow. First let's talk about FP8 weights. One limitation of FP8 support in Hopper and Lovelace is that the Tensor Cores only support TN GEMMs for FP8 (see cuBLAS docs). This is fine for the forward pass since that's the native format for torch.nn.Linear, but it means that we require transposes for the backward pass. Our default behavior when casting to FP8 is to use a fused cast-transpose kernel, which has better performance at the expense of having the same memory footprint as FP16/BF16. If you are performing inference and you know you will not require any backpropagation, then you can initialize the model within a torch.no_grad context and TE will not allocate memory for the transposes. You can also do this if you are willing to accept the performance penalty from unfused transpose kernels. We are aware this can be unintuitive, so we're working on nicer ways to specify this within the FP8 recipe. Note that these considerations may change with upcoming hardware and low-precision formats.

Next, activation tensors are usually not returned in FP8. We mostly use FP8 as GEMM inputs to take advantage of Tensor Cores, and we're generally cautious about using it for more numerically sensitive operations (e.g. activation functions, LayerNorm, GEMM outputs). Also, the internals of TE modules are somewhat messy, especially how they handle the per-tensor scaling factors needed for FP8 training. If you would like to manually access FP8 values, it might be worth looking into the experimental operation-based API (see the Quantize op from https://github.com/NVIDIA/TransformerEngine/pull/1033), which makes use of a Float8Tensor class that mimics a plain PyTorch tensor.

NVIDIA / TransformerEngine

Isn't the memory consumption should be dropped when using fp8? #1261