-
### Your question
I got:
Total VRAM 8188 MB, total RAM 16011 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4060 Laptop GPU : cudaMallocAsync
…
-
Hi, how to cast a float/bfloat16 tensor to fp8? I want to conduct W8A8 (fp8) quantization. But I didn't find an example of quantizing act to FP8 format.
-
### System Info
```shell
Optimum-habana v1.13.2
HL-SMI: hl-1.17.1-fw-51.5.0
Driver: 1.17.1-78932ae
```
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks…
-
The CUDA extended floating point types [`__half`](https://docs.nvidia.com/cuda/cuda-math-api/struct____half.html#struct____half) and [`__nv_bfloat16`](https://docs.nvidia.com/cuda/cuda-math-api/struct…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N…
-
### 🚀 The feature, motivation and pitch
3D fp8 matrix multiplication can be useful for fp8 model with 3D matmul (it also can be used to improve accuracy of models with 2D fp8 quantized matrix multi…
-
`Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.`
Hello there!
Thanks for sharing you…
-
Hello team,
we have been debugging large scale training instabilities with FP8 and noticed that these started when updating from transfomer-engine v1.2.1 to v1.7. Taking a closer look at the traini…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…
-
```
Some parameters are on the meta device device because they were offloaded to the cpu.
Quantizing weights: 0%| | 0/1771 [00:00