-
NF4 model 1024 X 1024 resolution 10 Series 20 Series 8G graphics card, running a picture to take four minutes
-
Hi!
I'm getting the following error when trying to use Transformer Engine: 100 errors detected in the compilation of "transformer_engine/common/transpose/rtc/cast_transpose.cu".
Compilation terminat…
-
-
### 🐛 Describe the bug
```python
import torch
import torch._inductor.config
torch._inductor.config.force_mixed_mm = True
def f(a, b):
return torch.mm(a, b.to(a.dtype))
fp16_act = torc…
-
### 🐛 Describe the bug
### 🚀 The feature, motivation and pitch
This is moving "issue 2" from https://github.com/pytorch/pytorch/issues/130015 to be tracked separately.
**Context**:
While using…
vkuzo updated
1 month ago
-
Hello, we have measured the FP8 GEMM performance using Triton on NVIDIA H100 (500 W, 1980 MHz). We would like to request your help in understanding if the performance is expected.
Since H100 FP8 o…
sryap updated
3 months ago
-
i want set tp size = 2 and the global world size = 2
the code is :
```
import os
import sys
import subprocess
import argparse
import torch
import torch.distributed as dist
import…
-
If someone is motivated, there could be some adaptation to support fp8 (on some hardware) using this new library:
https://github.com/NVIDIA/TransformerEngine
cc: @guillaumekln @francoisher…
-
### Feature Idea
Saw the claim on this reddit thread, hopefully the ideas there can also be brought into comfy for even more speedups.
https://www.reddit.com/r/StableDiffusion/comments/1ex64jj/i_m…
-
Was chatting with @Chillee about our plans in AO today and he mentioned we should be focusing on a few concrete problems like
1. Demonstrate compelling perf for fp8 gemm at a variety of batch sizes.
…