-
Dose support kv cache is fp8 or int8 , but calculate is also fp16?read kvcashe by int8 is more fast by fp16, then in shaerd memory will convert int8 to fp16 and calculate.
-
On my system, I have enough VRAM (72 GB) to run Llama-3-70B in 4-bit or 8-bit precision. However, I am unable to quantize this model to either 4-bit or 8-bit precision using the scripts in TensorRT-LL…
-
### 🐛 Describe the bug
Running torch.compile() on this Triton FP8 matmul code:
```python
def run_gemm() -> Tensor:
x_fp8: Tensor
w_fp8: Tensor
x_scale: Tensor
…
-
### 🐛 Describe the bug
```python
import torch
import torch._inductor.config
torch._inductor.config.force_mixed_mm = True
def f(a, b):
return torch.mm(a, b.to(a.dtype))
fp16_act = torc…
-
When I continue pretrain HF models with fp8, there is an error:
TypeError: ComposerHFCausalLM.__init__() got an unexpected keyword argument 'fc_type'
-
Thanks for this great project! I have some question about how you implemented Matmul for two MX format matrices.
This repo appears to provide its simulation, but do not provide its actual CUDA impl…
-
I'm doing A 4 bit x B fp16 matmul w/ large A and small B. I expect it to beat fp8 matmul (it should be memory-bound).
In reality, it seems to be always worse.
Example:
Kernel code is here: https…
-
Hello @AdnanHoque , I am trying to recreate the results from the blog [Accelerating Llama3 FP8 Inference with Triton Kernels](https://pytorch.org/blog/accelerating-llama3/). I haven't been able to get…
mgoin updated
2 months ago
-
I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram...…
-
Add ability to quantize to FP8. This will clearly need additional issues to be opened. Flags for the C++/Python API, Test cases, updates to our migraphx-driver, New kernels, a FP8 library , etc.
…