-
**Your question**
I'm trying to train GPT/LLAMA on top of Megatron-LM, but confused on fp8 performance.
Setting fp8 format parameters together with "--bf16" is much better than the situation witho…
-
FP8 is very useful in training or inference in LLM. Does flash attention support FP8?
Thank you~
-
**We see that for FP8 GEMM only TNN is supported in the cutlass_prolifer generated kernels and in the examples directory cutlass as well. Are there any fp8 kernel with other layouts like TTT/TTN shipp…
-
I get this error message if I set the max_len to 300 or any higher than 100 for that matter whenever I'm training to train with FP8. I'm using cuda-12.4.0-2 and the nightly cuda 12.4 pytorch builds an…
-
There is a use_fp flag for the offline_quantize tool in saxml/tool to quantize the weight in fp8 but still has to be stored in int8(https://github.com/google/praxis/blob/3f4cbb4bcda366db7b018695fbe2d4…
-
Hi, how to cast a float/bfloat16 tensor to fp8? I want to conduct W8A8 (fp8) quantization. But I didn't find an example of quantizing act to FP8 format.
-
### Feature request
I see the release version 1.12 has supported fp8, but I didn't see any example code for how to train LLM by using FP8.
How can I use FP8 to train model?
### Motivation
I want t…
-
### System Info
GPU - A10
### Who can help?
@Tracin
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks
- [X] An officially supported task in the `…
-
Since Ada GPUs like 4090 limit the FP8 arithmetic into `fp32` accumulation, it only achieve the same max `TFLOPs` compared to `fp16xfp16` with `fp16` accumulation.
Further more, according to my test,…
-
Hi again,
I've successfully quantized an onnx model to int8, then converted to tensorrt engine and noticed the performance increase compared to fp16.
```bash
python -m modelopt.onnx.quantizati…