bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.05k stars 608 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx #1363

Open LukeLIN-web opened 4 days ago

LukeLIN-web commented 4 days ago

System Info

I am using cuda_12.2, torch 2.1.0a0+29c30b1, bitsandbytes 0.43.3, python 3.10 Driver Version: 535.113.01 NVIDIA GeForce RTX 2080 Ti

Reproduction

import gc

import torch
from diffusers import LattePipeline
from transformers import T5EncoderModel, BitsAndBytesConfig
import imageio
from torchvision.utils import save_image

torch.manual_seed(0)

def flush():
    gc.collect()
    torch.cuda.empty_cache()

def bytes_to_giga_bytes(bytes):
    return bytes / 1024 / 1024 / 1024

video_length = 16
model_id = "maxin-cn/Latte-1"

text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16),
    device_map="auto",
    cache_dir="/data/"
)

pipe = LattePipeline.from_pretrained(
    model_id, 
    text_encoder=text_encoder,
    transformer=None,
    device_map="balanced",
    cache_dir="/data/"
)

with torch.no_grad():
    prompt = "a cat wearing sunglasses and working as a lifeguard at pool."
    negative_prompt = ""
    prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(prompt, negative_prompt=negative_prompt)

del text_encoder
del pipe
flush()

pipe = LattePipeline.from_pretrained(
    model_id,
    text_encoder=None,
    torch_dtype=torch.float16,
    cache_dir="/data/",
).to("cuda")
# pipe.enable_vae_tiling()
# pipe.enable_vae_slicing()

videos = pipe(
    video_length=video_length,
    num_inference_steps=50,
    negative_prompt=None, 
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    output_type="pt",
).frames.cpu()

print(f"Max memory allocated: {bytes_to_giga_bytes(torch.cuda.max_memory_allocated())} GB")

if video_length > 1:
    videos = (videos.clamp(0, 1) * 255).to(dtype=torch.uint8) # convert to uint8
    imageio.mimwrite('./latte_output.mp4', videos[0].permute(0, 2, 3, 1), fps=8, quality=5) # highest quality is 10, lowest is 0
else:
    save_image(videos[0], './latte_output.png')

https://github.com/Vchitect/Latte/issues/125#issue-2529714919

Expected behavior

https://huggingface.co/docs/bitsandbytes/v0.43.3/installation What is 4bit quantation GPU requirement?

matthewdouglas commented 3 days ago

Hi @LukeLIN-web, I was not able to reproduce this on an RTX 4090. That said, I would also expect it to work on a 2080 Ti, as that GPU is fully supported for 4bit quantization with bitsandbytes.

I suspect your stack trace is not giving the full picture, as we do not use cublasGemmEx in 4bit. This may come from a PyTorch operation. You may get a more clear trace by setting CUDA_LAUNCH_BLOCKING=1 in your environment.