apple / ml-stable-diffusion

Stable Diffusion with Core ML on Apple Silicon
MIT License
16.79k stars 935 forks source link

mixed_bit_compression_pre_analysis: Weird recipe results #293

Open MenKuch opened 11 months ago

MenKuch commented 11 months ago

I want to convert a custom sd-1.5-derived model to sub-6-bit-weights using mixed_bit_compression_pre_analysis and mixed_bit_compression_apply – but the analysis is not working correctly for me. As I read in issue #270 (https://github.com/apple/ml-stable-diffusion/issues/270), I’ve changed TEST_RESOLUTION = 512 before running analysis.

I’ve tested the following environments using the non-modified stable-diffusion-1.5-model on hugging face to rule out any issues with my model:

Python 3.8.18 + torch 2.0.0 + macOS 14 23A344

Command: python -m python_coreml_stable_diffusion.mixed_bit_compression_pre_analysis --model-version runwayml/stable-diffusion-v1-5 -o ./

Result: Error: input types 'tensor<1x616x1xf16>' and 'tensor<1xf32>' are not broadcast compatible

Python 3.10 + torch 2.1 + macOS 14 23A344

Command: python -m python_coreml_stable_diffusion.mixed_bit_compression_pre_analysis --model-version runwayml/stable-diffusion-v1-5 -o ./

(Same result as next test)

Python 3.10 + torch 2.2.0 nightly (from Fr Oct 27 2023) + macOS 14 23A344

Command: python -m python_coreml_stable_diffusion.mixed_bit_compression_pre_analysis --model-version runwayml/stable-diffusion-v1-5 -o ./

Results: After 3-4 hours on my M1 Max, the analysis finished and I found the recipe.json file as well as a PNG graph. But I think something went wrong calculating the recipes because I only found a 8.21 and 16.00 Bit mixedpalette recipe:

"model_version": "runwayml/stable-diffusion-v1-5",
  "baselines": {
    "original": 211.625,
    "linear_8bit": 80.125,
    "recipe_16.00_bit_mixedpalette": 211.6,
    "recipe_8.21_bit_mixedpalette": 82.6
  },

Also, the created graph looks strange with a way too high signal integrity:

runwayml_stable-diffusion-v1-5_psnr_vs_size

Additionally, in the final step of the analysis, the following error is reported:

Pipelines loaded withtorch_dtype=torch.float16cannot run withcpudevice. It is not recommended to move them tocpuas running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16operations on this device in PyTorch. Please, remove thetorch_dtype=torch.float16argument, or use another device for inference.

Any insights on what I am doing wrong?

atiorh commented 11 months ago

Thanks for the report @MenKuch! I ran into this in the past where pristine self-PSNR was ~220 (unexpected) instead of ~90 (expected) but it automagically went away.. The recipes we generated and published on Hugging Face Hub were done on V100 CUDA devices and they should be useful for you. A short term workaround I recommend is to reuse the v1-5 recipe we published even though your model is a fine-tune of it.

MenKuch commented 11 months ago

Hi! Thanks for your response.

Could you share a bit of information about the environment you used on the V100 CUDA devices? I tried it using NV T4-GPUs on Windows (using AWS) with torch 2.2.0 nightly and got the same results as on my M1 Max.

Background: My models are based on dreamshaper v8 (https://civitai.com/models/4384/dreamshaper) which is based on sd1.5 – I altered dreamshaper v8 with custom trained LoRAs for the effect I want to achieve and it turned out exactly like I wanted.

But: Dreamshaper V8 seems to be too different from the Apple provided recipes for sd1.5 since a reduction to 4.x bits degraded my models way too much. So the only way for me would be to get my own analysis/recipes. That's why I am interested in the environment (hardware, platform, python version, torch version etc) you've used for analysis.

MenKuch commented 11 months ago

I jumped the gun and used a NV V100 AWS Linux instance (with torch 2.0.1) to try out if mixed_bit_compression_pre_analysis is working there – and it is working (YAY!).

But sadly, a new issue arose: I did not get a recipe with low-5 or high-4 bitrates, just slightly below 6:

"model_version": "runwayml/stable-diffusion-v1-5",
  "baselines": {
    "original": 88.41249999999998,
    "linear_8bit": 80.2,
    "recipe_8.84_bit_mixedpalette": 86.4,
    "recipe_7.60_bit_mixedpalette": 85.8,
    "recipe_6.94_bit_mixedpalette": 85.1,
    "recipe_6.63_bit_mixedpalette": 84.6,
    "recipe_6.30_bit_mixedpalette": 84.2,
    "recipe_5.94_bit_mixedpalette": 83.4,
    "recipe_5.72_bit_mixedpalette": 83.0
  }

runwayml_stable-diffusion-v1-5_psnr_vs_size

The recipes from Apple for SD 1.5 seems to go WAY below 6 bits. This makes me believe I am still doing something wrong.

I tried again with the model "dreamshaper 8" and increasing the num-recipes to 11, but this time, the bitrate stayed above 6 and the graph comparison looks kinda weird:

"model_version": "Lykon/dreamshaper-8",
  "baselines": {
    "original": 59.462500000000006,
    "linear_8bit": 32.3,
    "recipe_15.52_bit_mixedpalette": 57.6,
    "recipe_13.91_bit_mixedpalette": 53.6,
    "recipe_12.76_bit_mixedpalette": 50.3,
    "recipe_11.10_bit_mixedpalette": 47.9,
    "recipe_9.98_bit_mixedpalette": 45.0,
    "recipe_8.83_bit_mixedpalette": 42.7,
    "recipe_7.49_bit_mixedpalette": 40.0,
    "recipe_7.16_bit_mixedpalette": 38.4,
    "recipe_6.94_bit_mixedpalette": 35.6,
    "recipe_6.71_bit_mixedpalette": 33.9,
    "recipe_6.37_bit_mixedpalette": 31.5
  }

Lykon_dreamshaper-8_psnr_vs_size

Anything I can do to generate recipes with bitrates below 6 to experiment?

MenKuch commented 11 months ago

I tried again to replicate the results in the Apple-provided recipes using mixed_bit_compression_pre_analysis for runwayml/stable-diffusion-v1-5 – sadly, without any success.

I used different GPUs on AWS (V100, A10G), different PyTorch versions (1.13, 2.0), different sizes for the "TEST_RESOLUTION" constant in "mixed_bit_compression_pre_analysis.py" (512, 768) but nothing led to bitrates like in the Apple-provided json recipes.

I would be very grateful if anyone could share information on the environment used to replicate the Apple results.