-
Hi experts, I tried to use Transformer Engine to do the inference test of llama2 on H800 and found that the speed of fp8 was much slower than fp16, bellow is a small reproduction that only contains `L…
-
Hello! Thank you very much for this FP8 rowwise matmul code, it's been extremely helpful. However, there is a subtle bug/hidden requirement when eg. calling this code here:
https://github.com/pytor…
-
### Proposal to improve performance
_No response_
### Report of performance regression
Following the blog post [announcement](https://blog.vllm.ai/2024/07/23/llama31.html), I tried to replica…
-
Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model?
cc: @Tracin
-
**Describe the bug**
Adding `"zero_quantized_weights": true,` leads to a crash:
```
35:1]: warnings.warn(
[35:1]:Traceback (most recent call last):
[35:1]: File "/data/env/lib/repos/retro-l…
-
I tried flux training on a 2080ti with 22GB of VRAM, but I keep getting an error:
` Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Ex…
-
### Request description
The scale parameter was added to the AttentionOp/OnlineAttentionOp as a stopgap solution to make models work. Now that we are in a better place to support attention, it's time…
-
I really like the simplicity of TK and think it could be broadly applicable to kernel authoring beyond attention. Has there been any benchmarking done of pure GEMM operations? If so, an example would …
-
Describe:
I set weight/activation with QuantType.QFLOAT8E4M3FN when calling quantize_static, but I get the following errors:
````
Traceback (most recent call last):
File "/home/developer/wor…
-
Can we make `fbgemm-gpu` an optional dependency? https://pypi.org/project/fbgemm-gpu/#files It doesn't look like it's supported on a mac https://github.com/pytorch/FBGEMM/issues/1985. This means means…