compile vae causes segmentation fault at deconstruction.

Thanks for your awesome work!

Recently I'm trying to use stable fast to accelerate VAE computation and encountered segmentation faults at program exit. I have made a minimal reproduction and hope it could help stable-fast become more stable :)

Minimal reproduction:

import torch
import torch.nn.functional as F

from diffusers import AutoencoderKL

from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile_vae,
    CompilationConfig,
)

device = torch.device("cuda:0")

SD_2_1_DIFFUSERS_MODEL = "stabilityai/stable-diffusion-2-1"
variant = {"variant": "fp16"}
vae_orig = AutoencoderKL.from_pretrained(
        SD_2_1_DIFFUSERS_MODEL,
        subfolder="vae",
        torch_dtype=torch.float16,
        **variant,
    )

vae_orig.to(device)

sfast_config = CompilationConfig.Default()
sfast_config.enable_xformers = False
sfast_config.enable_triton = True
sfast_config.enable_cuda_graph = False
vae = compile_vae(vae_orig, sfast_config)

sample_imgs = torch.randn(4, 3, 128, 128, dtype=vae.dtype, device=device)
latents1 = torch.randn(4, 4, 16, 16, dtype=vae.dtype, device=device)

latents = vae.encode(sample_imgs).latent_dist.sample()

sample_imgs_dup = sample_imgs.clone().detach().requires_grad_(True)
latents2 = vae_orig.encode(sample_imgs_dup).latent_dist.sample()
print("Test done")

The gdb points to this line at core dump but I'm not very sure about the course. It's not a null pointer but definitely something not accessible.

chengzeyi / stable-fast

compile vae causes segmentation fault at deconstruction. #118