-
Recently I've run a benchmark and recall test on the new halfvec and bit types, and they both yielded impressive results.
All my tests were run against public data on https://meta.discourse.org/, a…
-
### 🐛 Describe the bug
Running torch.compile() on this Triton FP8 matmul code:
```python
def run_gemm() -> Tensor:
x_fp8: Tensor
w_fp8: Tensor
x_scale: Tensor
…
-
### 🐛 Describe the bug
```python
import torch
import torch._inductor.config
torch._inductor.config.force_mixed_mm = True
def f(a, b):
return torch.mm(a, b.to(a.dtype))
fp16_act = torc…
-
When I continue pretrain HF models with fp8, there is an error:
TypeError: ComposerHFCausalLM.__init__() got an unexpected keyword argument 'fc_type'
-
This is an umbrella issue for allowing fp8 type(s) in shark, spanning all the required layers of the stack: Turbine, IREE, MLIR, LLVM, including backends of interest like ROCm.
Some initial researc…
kuhar updated
5 months ago
-
Hi experts, I tried to use Transformer Engine to detects flops that 4090 can achieve using fp8.I used te.Linear for my evaluation and got a maximum TFLOPS of only 150+。For fp16, the maximum is only 80…
-
Thanks for this great project! I have some question about how you implemented Matmul for two MX format matrices.
This repo appears to provide its simulation, but do not provide its actual CUDA impl…
-
I'm doing A 4 bit x B fp16 matmul w/ large A and small B. I expect it to beat fp8 matmul (it should be memory-bound).
In reality, it seems to be always worse.
Example:
Kernel code is here: https…
-
Hello @AdnanHoque , I am trying to recreate the results from the blog [Accelerating Llama3 FP8 Inference with Triton Kernels](https://pytorch.org/blog/accelerating-llama3/). I haven't been able to get…
mgoin updated
2 months ago
-
I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram...…