facebookincubator / AITemplate

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
4.54k stars 363 forks source link

stable_diffusion alt example generates bad images on T4 GPU (sm 7.5) #781

Open apivovarov opened 1 year ago

apivovarov commented 1 year ago

I tried to use compile_alt and demo_alt to compile and run stable_diffusion model on T4 GPU (sm 7.5) The generated image is a mess - Example

Commands which I used

python3 scripts/download_pipeline.py

python3 scripts/compile_alt.py \
--width 512 512 --height 512 512 \
--include-constants True \
--local-dir tmp/diffusers-pipeline/stabilityai/stable-diffusion-v2

python3 scripts/demo_alt.py \
--hf-hub-or-path tmp/diffusers-pipeline/stabilityai/stable-diffusion-v2 \
--prompt "a photo of an astronaut riding a horse on mars" \
--batch 1

To reproduce this issue on A100 GPU edit aitemplate/testing/detect_target.py and return "75" in _detect_cuda()

def _detect_cuda():
    if True:
        return "75"

Related issue - https://github.com/facebookincubator/AITemplate/issues/758

@terrychenism @hlky

ipiszy commented 1 year ago

Some AIT kernels (e.g. mem-efficient-attention kernel) may lack SM75 specifications. SM80+ is needed. (Also check https://github.com/facebookincubator/AITemplate#installation).

apivovarov commented 1 year ago

Ying, I was able to compile and run SD 2.1 on SM75 using regular (non-alt) compile/demo scripts. Images are good. (Related PR https://github.com/facebookincubator/AITemplate/pull/765). Ironically, the fix was to use variable batch (1,8) instead of using particular user provided single value (e.g. 1) in compile_clip.py.

Alt scripts add support for dynamic batch. Default batch shape in compile_alt.py is (1,4)
Smth is not working correctly when we use dynamic batches in these 3 models in SD pipeline.

hlky commented 1 year ago

--width 512 512 --height 512 512 Not sure why you are doing this, the purpose of the alt scripts is to support dynamic shape.

stabilityai/stable-diffusion-v2 doesn't exist, I assume you mean stabilityai/stable-diffusion-2.

on A100 GPU edit aitemplate/testing/detect_target.py and return "75" in _detect_cuda() This will not help, by doing this you are setting cuda gencode level, whereas any issue with SM75 would only be noticeable when profiled on SM75.

I have tested using T4 from GCP, and cannot reproduce this.

Using ComfyUI plugin and pre-compiled SM75 modules. Specifically I used linux/sm75/bs1/1024/v2_unet_64_1024 and linux/sm75/bs1/1024/v2_vae_64_1024, these modules support dynamic shape 64-1024, for UNet batch size 1 is technically range 1-2 to support classifier free guidance scale. stabilityai/stable-diffusion-2-1-base msedge_5J9Aa88xSl stabilityai/stable-diffusion-2 msedge_H32uadlMHk

Using python3 scripts/demo_alt.py --hf-hub-or-path stabilityai/stable-diffusion-2 --prompt "a photo of an astronaut riding a horse on mars" --batch 1 example_ait

I would suggest checking the environment you are compiling modules with, Python version, CUDA version, PyTorch version etc, for reference the environment used for the pre-compiled SM75 modules.

GCP Ubuntu 20.04
Python 3.8.10
PyTorch 2.0.1+cu117
nvcc 11.6
Driver Version: 510.47.03
CUDA Version: 11.6
Tesla T4

I would also suggest using the dynamic shape options correctly, --width 64 1024 --height 64 1024 is a much better choice than 512 512. In general if you are not using the dynamic shape you will require dozens of modules to support a variety of resolutions, the same applies for using --include-constants True, there is no need to have distinct modules for every model, as long as the model architecture matches the module architecture the weights can be applied at runtime, this means all SD2.x models will load to SD2.x modules regardless of whether the model is a base or 768 version.

apivovarov commented 1 year ago

Google colab provides free notebooks with T4 GPU runtime.

I prepared notebook AIT_alt_bad_image.ipynb to demonstrate bad image generation on T4 GPU.

Model: stabilityai/stable-diffusion-2-1-base - 512x512

bad image