-
### 🚀 The feature, motivation and pitch
share repro for @bdhirsh , @tugsbayasgalan on the gaps of torch.compile for FSDP2 fp8 all-gather
for FSDP2 fp8 all-gather, it's criticial to pre-compute ama…
-
### Describe the issue
when i use gemm_float8 to run with input A(fp8 e5m2), input B(fp8 e4m3), can not run, but input A(fp8 e4m3), input B(fp8 e4m3) will run right,
### To reproduce
run gemm_floa…
-
I have seen that the AutoFP8 quantized models from Huggingface, especially Mixtral-8x7B-FP8 is supported by vllm. I am wondering if both kv_cache and weight quantized models quantized by AutoFP8 are …
-
### Env
- Inside docker, `nvcr.io/nvidia/pytorch:24.06-py3`
- L20 GPU, Driver Version: 550.90.07 CUDA Version: 12.4
- TensorRT 10.1.0
### Steps
1 make plugins and copy `plugins` folder…
-
# 🚀 Feature
FP8 is very useful in training or inference in LLM. Does xformers support FP8?
Thank you~
-
https://github.com/huggingface/text-generation-inference/blob/d0225b10156320f294647ac676c130d03626473d/server/text_generation_server/layers/fp8.py#L4
@Narsil what do you think about enabling torch.…
-
**Your question**
I'm trying to train GPT/LLAMA on top of Megatron-LM, but confused on fp8 performance.
Setting fp8 format parameters together with "--bf16" is much better than the situation witho…
-
### Feature request
Hi!
Could anyone please help me with using HuggingFace models (LLaMa [or if LLaMa is difficult, MPT-7b]) with the TransformerEngine TE FP8 inference? We really need the speedup
…
-
FP8 is very useful in training or inference in LLM. Does flash attention support FP8?
Thank you~
-
**We see that for FP8 GEMM only TNN is supported in the cutlass_prolifer generated kernels and in the examples directory cutlass as well. Are there any fp8 kernel with other layouts like TTT/TTN shipp…