-
The image generated in f8 mode will report an error
Error occurred when executing T5TextEncode #ELLA:
"index_select_cuda" not implemented for 'Float8_e5m2'
-
### System Info
CPU architecture: x86_64
Host RAM: 1TB
GPU: 8xH100 SXM
Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend
TensorRT-LLM version: 0.10.0.dev2024043000
Dr…
-
Describe:
I set weight/activation with QuantType.QFLOAT8E4M3FN when calling quantize_static, but I get the following errors:
````
Traceback (most recent call last):
File "/home/developer/wor…
-
i want set tp size = 2 and the global world size = 2
the code is :
```
import os
import sys
import subprocess
import argparse
import torch
import torch.distributed as dist
import…
-
Loss in nan in the first batch of training itself when transformer architecture uses [rotary embedding](https://github.com/lucidrains/rotary-embedding-torch)
-
### System Info
- GPU name: L40s
- CUDA: 12.1
```
Wed Jun 5 16:27:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 …
-
### 🚀 The feature, motivation and pitch
I found that some kernels use 32-bit integers as indices, which can easily lead to overflow. I think change them into int64_t (or other 64bit types) will be sa…
-
ENV: RTX 8*4090
I want to test FP8 of TransformerEngine in llama3 (from huggingface) for inference. I can not find instructions on inference. Can you share some code?
Thank you~
-
Hi! Is adding FP8 transformer engine (H100) speedup to inference planned?
If not, could you please give me an outline of what needs to be done in order for me to work on that?
Thank you!
-
SOTA (CUBLAS, CUTLASS) FP8 GEMM kernels are performing poorly for small M (bs*seq_len) < 32 regime.
This work will focus on leveraging the performant pieces of the [Marlin](https://github.com/IST-D…