-
I just want to share an anecdotal information, from a user perspective, in case you have issues with some loras.
I had issues running this large lora ( 1.28 GB ), from civitai, with Flux FP8.
http…
-
使用原版没有问题https://github.com/balazik/ComfyUI-PuLID-Flux.git
-
Since Ada GPUs like 4090 limit the FP8 arithmetic into `fp32` accumulation, it only achieve the same max `TFLOPs` compared to `fp16xfp16` with `fp16` accumulation.
Further more, according to my test,…
-
Hello, it looks like EmbeddingBagCollection forces data type to be float32 or float16 during initialization.
https://github.com/pytorch/torchrec/blob/main/torchrec/modules/embedding_modules.py#L179
…
-
### System Info
CPU x86_64
GPU NVIDIA L20
TensorRT branch: v0.13.0
CUDA: NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.5
### Who can help?
@kaiyux @byshiue
### Information…
-
`Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.`
Hello there!
Thanks for sharing you…
-
I would like to ask the exact speed benchmarking configuration in the paper, as they are not mentioned. I test the kernel on RTX 4090, pytorch 2.4 cu118 and the corresponding triton version. The resul…
-
### Anything you want to discuss about vllm.
I am trying to perform a serving performance test using pipeline parallelism with the LLAMA 3.1 405B model as a draft model with 8b, but the model fails t…
-
I'm doing A 4 bit x B fp16 matmul w/ large A and small B. I expect it to beat fp8 matmul (it should be memory-bound).
In reality, it seems to be always worse.
Example:
Kernel code is here: https…
-
Hello @mgoin, it's a pleasant surprise to discover this project. Thank you for your contributions to BitBLAS. We have recently added support for FP8 Matmul, hoping it will help this project.