(memory?) issue with stable_diffusion_v2_1_webui_colab when mounting Google Drive

system1system2 commented 1 year ago

Hi. All your colab notebooks are amazing. Thanks for sharing them with the community.

I have a problem with one of them: stable_diffusion_v2_1_webui_colab

If I create a new cell to mount my Google Drive and run it before your cell to initialize SD2.1, the initialization interrupts half way and I get this output:

Python 3.8.16 (default, Dec 7 2022, 01:12:13) [GCC 7.5.0] Commit hash: 4af3ca5393151d61363c30eef4965e694eeac15e Installing gfpgan Installing clip Installing open_clip Cloning Stable Diffusion into repositories/stable-diffusion-stability-ai... Cloning Taming Transformers into repositories/taming-transformers... Cloning K-diffusion into repositories/k-diffusion... Cloning CodeFormer into repositories/CodeFormer... Cloning BLIP into repositories/BLIP... Installing requirements for CodeFormer Installing requirements for Web UI Launching Web UI with arguments: --share --force-enable-xformers No module 'xformers'. Proceeding without it. Cannot import xformers Traceback (most recent call last): File "/content/stable-diffusion-webui/modules/sd_hijack_optimizations.py", line 18, in import xformers.ops ModuleNotFoundError: No module named 'xformers.ops'; 'xformers' is not a package

Loading config from: /content/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.yaml LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. Downloading: 100% 3.94G/3.94G [00:57<00:00, 69.0MB/s] ^C

I read that the ^C interrupt might indicate the system has run out of memory.

If I do not run the cell that mounts Google Drive, everything works fine.

Also, and this is where it's strange, if I run another of your Colab notebooks, like analog_diffusion_webui_colab, by running the cell that mounts Google Drive first, everything works fine, too.

camenduru commented 1 year ago

--force-enable-xformers obsolete please use --xformers

system1system2 commented 1 year ago

Thanks for the quick reply, @camenduru

The issue is not related to the xformers module (even if it appeared in my quoted output). Even before the flag --force-enable-xformers was rendered obsolete, I had this issue. And now, after updating the notebook with the new flag --xformers, I still have it:

Python 3.8.16 (default, Dec 7 2022, 01:12:13) [GCC 7.5.0] Commit hash: 4af3ca5393151d61363c30eef4965e694eeac15e Installing gfpgan Installing clip Installing open_clip Cloning Stable Diffusion into repositories/stable-diffusion-stability-ai... Cloning Taming Transformers into repositories/taming-transformers... Cloning K-diffusion into repositories/k-diffusion... Cloning CodeFormer into repositories/CodeFormer... Cloning BLIP into repositories/BLIP... Installing requirements for CodeFormer Installing requirements for Web UI Launching Web UI with arguments: --share --xformers Loading config from: /content/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.yaml LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. Downloading: 100% 3.94G/3.94G [00:56<00:00, 70.3MB/s] ^C

This issue exclusively happens with this particular notebook (again, I don't have it with, for example, the Analog Diffusion module), and only if I try to mount my Google Drive before running the A1111 installation cell.

camenduru commented 1 year ago

this is working https://github.com/camenduru/stable-diffusion-webui-colab/blob/main/stable_diffusion_v2_1_webui_colab.ipynb please compare your code with stable_diffusion_v2_1_webui_colab.ipynb

camenduru commented 1 year ago

wait it is not working with gdrive interesting 😋

system1system2 commented 1 year ago

I change nothing from your code. All I do is:

click on the colab button from you GitHub page. It opens it.
ask to copy it to my drive. It does it.
double-click on my copy from the Notebook folder in my GDrive. It opens it.
ask to mount my GDrive. It creates a new cell under your original cell.
move the cell up. It does it.
run my cell to mount my GDrive. It executes it correctly in 17s.
run your cell to install and launch A1111. Then I get the error.

I do exactly these steps with your colab for Analog Diffusion and it works flawlessly, as expected.

I have no clue why there's this difference in behaviour.

camenduru commented 1 year ago

stable_diffusion_v2_1_webui_colab	with gdrive (crashed)	without gdrive

stable_diffusion_1_5_webui_colab with gdrive Screenshot 2022-12-29 151319

stable_diffusion_v2_1 using too much system ram without gdrive working with gdrive and stable_diffusion_v2_1 not fitting the system ram 😨

camenduru commented 1 year ago

if we convert fp32 v2-1_768-ema-pruned.ckpt 5.21 GB to fp16 5.21/2 GB probably fits

MitPitt commented 1 year ago

Had the same problem, figured out a workaround fix — crash the colab right before launching the UI, this will free up the RAM

Do this after downloading the models:

import os
os.kill(os.getpid(), 9)

This will crash the runtime. Now reconnect and run:

%cd /content/stable-diffusion-webui
!python launch.py --share --xformers

camenduru commented 1 year ago

thanks @MitPitt ❤ good idea 🤩

system1system2 commented 1 year ago

Thanks, @MitPitt, but I still can't make it work.

I split @camenduru's original notebook into multiple cells as in the screenshot. I executed your recommended os.kill cell. The environment crashes as expected and reconnects automatically.

Then I proceed launching A1111, but I still run out of system RAM.

What am I doing wrong?

MitPitt commented 1 year ago

Google drive is taking RAM as well, I had this problem. You will have to download any needed files manually, without mounting the drive. Use this command to download public files from google drive:

!curl -o train_images.zip -L 'https://drive.google.com/uc?export=download&confirm=yes&id=[ID]' # repalce ID

And you can find your file's ID by looking at the share link: https://drive.google.com/file/d/ABCDEFG/view?usp=share_link Here, the ID is ABCDEFG

camenduru commented 1 year ago

hi @system1system2 👋 I converted to fp16 now 2.58 GB please use this with gdrive

https://huggingface.co/ckpt/stable-diffusion-2-1/resolve/main/v2-1_768-ema-pruned-fp16.ckpt

system1system2 commented 1 year ago

Thank you so much for converting this. Unfortunately, I still have issues:

I have modified your colab notebook to download the correct file and save it with the old file name, so I don't have to rename the yaml file as well:

!wget https://huggingface.co/ckpt/stable-diffusion-2-1/resolve/main/v2-1_768-ema-pruned-fp16.ckpt -O /content/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.ckpt !wget https://raw.githubusercontent.com/Stability-AI/stablediffusion/main/configs/stable-diffusion/v2-inference-v.yaml -O /content/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.yaml

It correctly downloads the half-precision variant (which is saved in the Colab drive as a 2.4G file), but then it insists in loading a 3.4GB file:

and that's where it runs out of memory as usual:

Python 3.8.16 (default, Dec 7 2022, 01:12:13) [GCC 7.5.0] Commit hash: 4af3ca5393151d61363c30eef4965e694eeac15e Installing gfpgan Installing clip Installing open_clip Cloning Stable Diffusion into repositories/stable-diffusion-stability-ai... Cloning Taming Transformers into repositories/taming-transformers... Cloning K-diffusion into repositories/k-diffusion... Cloning CodeFormer into repositories/CodeFormer... Cloning BLIP into repositories/BLIP... Installing requirements for CodeFormer Installing requirements for Web UI Launching Web UI with arguments: --share --xformers A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' Loading config from: /content/stable-diffusion-webui/models/Stable-diffusion/v2-1_768-ema-pruned.yaml LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. Downloading: 100% 3.94G/3.94G [01:04<00:00, 61.2MB/s] ^C

Also notice that during the process, python raises a new error:

A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton'

I don't know if it's important or not as I cannot load the UI to test image generation.

camenduru commented 1 year ago

at this point, I am thinking that there may be a memory leak in the code 🤔

MisoSpree commented 1 year ago

Agree with all above. I tried just installing the WebUI without connecting my drive. It died the same death as described above.

Launching Web UI with arguments: --share --xformers A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' Loading config from: /content/stable-diffusion-webui/models/Stable-diffusion/768-v-ema.yaml LatentDiffusion: Running in v-prediction mode DiffusionWrapper has 865.91 M params. Downloading: 100% 3.94G/3.94G [00:56<00:00, 69.6MB/s] ^C

And I switched to this latest version because I can't fix the "RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)" on the older version that I used to be able to fix. (Now, none of the suggested edits to the ddpm file work.) I would so dearly love to train more embeddings but I can't seem to find a version that runs for me on Colab (with a paid account.)

Edit: But I did get the WebUI running from midjourney_v4_diffusion_webui_colab.ipynb before attaching the Google Drive. (Now trying to mount the drive does nothing. No pop-up, no error message, no mount.) And also the runtime error about indices is still a problem. I am really sad about this.

inu-ai commented 1 year ago

Forking and patching the stablediffusion repository of Stability-AI will bring it within 12 GB. Here is a similar Issue. (Translated at DeepL)

https://github.com/ddPn08/automatic1111-colab/issues/16 https://github.com/ddPn08/automatic1111-colab/commit/27484525d2aaf98b8ba75ce955dd553dc2eb3ab3

camenduru commented 1 year ago

@thx-pw さん、ありがとうございます。 ❤ ❤

camenduru commented 1 year ago

!sed -i -e '''/prepare_environment()/a\ os.system\(f\"""sed -i -e ''\"s/dict()))/dict())).cuda()/g\"'' /content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/util.py""")''' /content/stable-diffusion-webui/launch.py

camenduru commented 1 year ago

@thx-pwさん、こちらも動作しています return get_obj_from_str(config["target"])(**config.get("params", dict())).cuda()

camenduru commented 1 year ago

ご確認ください

camenduru commented 1 year ago

@system1system2 please try this https://github.com/camenduru/stable-diffusion-webui-colab/blob/main/stable_diffusion_v2_1_webui_colab.ipynb

inu-ai commented 1 year ago

You can do it in one line. That's smarter.

MisoSpree commented 1 year ago

@system1system2 please try this https://github.com/camenduru/stable-diffusion-webui-colab/blob/main/stable_diffusion_v2_1_webui_colab.ipynb

I am running. So far, sed is throwing "no such file or directory" errors (for both sed calls). Edit: but apparently it doesn't matter? I couldn't mount my Google drive, but I just uploaded my training images and am now training an embedding.

camenduru commented 1 year ago

hi @MisoSpree 👋 sed is working we are getting this message because we are using sed inside sed before getting the file from repo little trick hehe

sed: can't read /content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/util.py: No such file or directory
sed: can't read /content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py: No such file or directory

MisoSpree commented 1 year ago

Roger that. Ignoring error messages is right up my alley.

MisoSpree commented 1 year ago

Note that when training an embedding, the loss is reported as a NaN:

[Epoch 499: 10/10]loss: nan: 10% 4999/50000 [45:34<6:49:13, 1.83it/s]

And the image put out every N steps is just black. Looks like something is broken still.

camenduru commented 1 year ago

oh no 😐

system1system2 commented 1 year ago

@camenduru believe or not, it woks (at least for ordinary txt2img generations - I didn't try to train an embedding like @MisoSpree). The sed weird trick worked, but you might want to say something about it in the documentation or you'll have an avalanche of people reporting the same No such file or directory error that @MisoSpree reported.

Thanks for the patience in fixing this. I'm training without any issues this morning thanks to you.

ddPn08 commented 1 year ago

In my environment, I had no problem learning embedding.

camenduru commented 1 year ago

hi @ddPn08 can you train without black example output? please show us how

camenduru commented 1 year ago

I tried, and I also got a black output 😭

ddPn08 commented 1 year ago

I created embedding from the train tab of AUTOMATIC1111 and trained without changing any settings. I tested it on my notebook, so I'll try it on this one too.

camenduru commented 1 year ago

did you use this colab https://github.com/camenduru/stable-diffusion-webui-colab/blob/main/stable_diffusion_v2_1_webui_colab.ipynb

MisoSpree commented 1 year ago

I tried, and I also got a black output 😭

I am glad I am not the only one. Were you seeing loss reported as NaN?

Edit: Just to check, I did this again today. This is in Colab. Today I first connected my Google Drive. (This is different from the last time when I didn't connect the google drive at all.) Then I ran https://github.com/camenduru/stable-diffusion-webui-colab/blob/main/stable_diffusion_v2_1_webui_colab.ipynb. Everything installed. I generated a single text-to-image (which I always do as a test when I get the WebUI open.) That worked fine. Then I created an embedding and ran the training. Still, loss is being reported as NaN and the first output image was all black. Then I stopped the training.

inu-ai commented 1 year ago

Even after removing the low RAM patch, I am still getting the nan error in learning. So I tried everything and when I changed from SD2.1 to WD1.4e1, the nan error was gone. https://huggingface.co/hakurei/waifu-diffusion-v1-4/tree/main I don't know why.

system1system2 commented 1 year ago

Just a quick note to let you know, @camenduru, that this new version of the notebook runs out of memory again :)

The problem is the single sed command, in place of the previous two.

If you replace it with the previous two lines below, the notebook works just fine, including triton installation and the new CivitAI extension:

!sed -i -e '''/prepare_environment()/a\ os.system(f\"""sed -i -e ''\"s/self.logvar\[t\]/self.logvar\[t.item()\]/g\"'' /content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/models/diffusion/ddpm.py""")''' /content/stable-diffusion-webui/launch.py !sed -i -e '''/prepare_environment()/a\ os.system(f\"""sed -i -e ''\"s/dict()))/dict())).cuda()/g\"'' /content/stable-diffusion-webui/repositories/stable-diffusion-stability-ai/ldm/util.py""")''' /content/stable-diffusion-webui/launch.py

camenduru commented 1 year ago

I tested it this one and it worked with gdrive I didn't change anything maybe you are getting less ram I got 12.68GB https://github.com/camenduru/stable-diffusion-webui-colab/blob/main/stable_diffusion_v2_1_webui_colab.ipynb Screenshot 2023-01-13 162026

system1system2 commented 1 year ago

Same amount. Not sure why it works with the two sed lines but fails with a single one. At this point, it's up to you. We can close this issue as is (as it works for me, at least with this specific configuration) or leave it open.

camenduru / stable-diffusion-webui-colab

(memory?) issue with stable_diffusion_v2_1_webui_colab when mounting Google Drive #21