-
## 🚀 Feature
CuDNN provides flexible support for performant gemm/conv with fp8 quantization. Thunder introducing fp8 casts in its traces can benefit from cudnn fusions.
### Motivation
Today, thu…
-
Hi, I am trying to use fp8 with TransformerEngine. I am using a version of GPT-Neox repo, which uses deepspeed.
I can get fp8 to run in my MLPs with model parallel, but when I use pipeline paralle…
exnx updated
2 months ago
-
### Your question
I got:
Total VRAM 8188 MB, total RAM 16011 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4060 Laptop GPU : cudaMallocAsync
…
-
Hi, how to cast a float/bfloat16 tensor to fp8? I want to conduct W8A8 (fp8) quantization. But I didn't find an example of quantizing act to FP8 format.
-
Since Ada GPUs like 4090 limit the FP8 arithmetic into `fp32` accumulation, it only achieve the same max `TFLOPs` compared to `fp16xfp16` with `fp16` accumulation.
Further more, according to my test,…
-
Is there a way to run these models with 12 GB RAM?
With fp8 models it is working but with GGUF models it always fail.
-
[context_flashattention_nopad_fp16_fp8.txt](https://github.com/user-attachments/files/16421521/context_flashattention_nopad_fp16_fp8.txt)
we have implemented a f8 version of context_flashattention_…
-
### System Info
```shell
Optimum-habana v1.13.2
HL-SMI: hl-1.17.1-fw-51.5.0
Driver: 1.17.1-78932ae
```
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks…
-
### Feature request
I see the release version 1.12 has supported fp8, but I didn't see any example code for how to train LLM by using FP8.
How can I use FP8 to train model?
### Motivation
I want t…
-
First of all, thanks for an amazing project! This runs on average 30% faster than flux on ComfyUI. I was wondering if there's any planned support for different schedulers and samplers like how you can…