TheLastBen / fast-stable-diffusion

fast-stable-diffusion + DreamBooth
MIT License
7.56k stars 1.31k forks source link

SDXL Lora Paperspace Trainer Errors - Non-Zero Exit Status #2798

Closed sixpt closed 8 months ago

sixpt commented 8 months ago

Hi, I'm having an error trying to train a Lora on Paperspace. I've tried running it normally and in Jupyter. Everything works up until the training step, with default parameters or my own. I can see an error flash too quickly to read after the Training Text Encoder step, then it errors out and terminates with 'non-zero exit status 1' during Training UNet. Any idea what might be happening?

I've tried using the latest version an starting from scratch. I successfully ran it twice yesterday and started getting these errors last night and this morning.

from diffusers import AutoencoderKL, PNDMScheduler, StableDiffusionXLPipeline, UNet2DConditionModel ImportError: cannot import name 'StableDiffusionXLPipeline' from 'diffusers' (/usr/local/lib/python3.9/dist-packages/diffusers/init.py) '########:'########:::::'###::::'####:'##::: ##:'####:'##::: ##::'######::: ... ##..:: ##.... ##:::'## ##:::. ##:: ###:: ##:. ##:: ###:: ##:'##... ##:: ::: ##:::: ##:::: ##::'##:. ##::: ##:: ####: ##:: ##:: ####: ##: ##:::..::: ::: ##:::: ########::'##:::. ##:: ##:: ## ## ##:: ##:: ## ## ##: ##::'####: ::: ##:::: ##.. ##::: #########:: ##:: ##. ####:: ##:: ##. ####: ##::: ##:: ::: ##:::: ##::. ##:: ##.... ##:: ##:: ##:. ###:: ##:: ##:. ###: ##::: ##:: ::: ##:::: ##:::. ##: ##:::: ##:'####: ##::. ##:'####: ##::. ##:. ######::: :::..:::::..:::::..::..:::::..::....::..::::..::....::..::::..:::......:::: Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '/notebooks/diffusers/examples/dreambooth/train_dreambooth_sdxl_TI.py', '--external_captions', '--dim=64', '--ofstnselvl=0', '--image_captions_filename', '--Session_dir=/notebooks/Fast-Dreambooth/Sessions/TD_v3', '--pretrained_model_name_or_path=/notebooks/stable-diffusion-XL', '--instance_data_dir=/notebooks/Fast-Dreambooth/Sessions/TD_v3/instance_images', '--output_dir=/notebooks/models/TD_v3', '--captions_dir=/notebooks/Fast-Dreambooth/Sessions/TD_v3/captions', '--seed=127298', '--resolution=1024', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=1e-6', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--num_train_epochs=40']' returned non-zero exit status 1. '########:'########:::::'###::::'####:'##::: ##:'####:'##::: ##::'######::: ... ##..:: ##.... ##:::'## ##:::. ##:: ###:: ##:. ##:: ###:: ##:'##... ##:: ::: ##:::: ##:::: ##::'##:. ##::: ##:: ####: ##:: ##:: ####: ##: ##:::..::: ::: ##:::: ########::'##:::. ##:: ##:: ## ## ##:: ##:: ## ## ##: ##::'####: ::: ##:::: ##.. ##::: #########:: ##:: ##. ####:: ##:: ##. ####: ##::: ##:: ::: ##:::: ##::. ##:: ##.... ##:: ##:: ##:. ###:: ##:: ##:. ###: ##::: ##:: ::: ##:::: ##:::. ##: ##:::: ##:'####: ##::. ##:'####: ##::. ##:. ######::: :::..:::::..:::::..::..:::::..::....::..::::..::....::..::::..:::......:::: Traceback (most recent call last): File "/notebooks/diffusers/examples/dreambooth/train_dreambooth_sdxl_lora.py", line 23, in from diffusers import AutoencoderKL, PNDMScheduler, StableDiffusionXLPipeline, UNet2DConditionModel ImportError: cannot import name 'StableDiffusionXLPipeline' from 'diffusers' (/usr/local/lib/python3.9/dist-packages/diffusers/init.py) Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '/notebooks/diffusers/examples/dreambooth/train_dreambooth_sdxl_lora.py', '--external_captions', '--saves=[30,60,90,120,150,180,210,240,270,300,330]', '--dim=64', '--ofstnselvl=0', '--image_captions_filename', '--Session_dir=/notebooks/Fast-Dreambooth/Sessions/TD_v3', '--pretrained_model_name_or_path=/notebooks/stable-diffusion-XL', '--instance_data_dir=/notebooks/Fast-Dreambooth/Sessions/TD_v3/instance_images', '--output_dir=/notebooks/models/TD_v3', '--captions_dir=/notebooks/Fast-Dreambooth/Sessions/TD_v3/captions', '--seed=127298', '--resolution=1024', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=1e-6', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--num_train_epochs=360']' returned non-zero exit status 1. [I 2024-04-04 13:36:58.791 ServerApp] Saving file at /SDXL-LoRA-PPS-Copy1.ipynb 75.9 GiB | 150 GiB 0% GPU 0% RAM 1.2|45 GiB paperspace/gradient-base:pt112-tf29-jax0317-py39-20230125

TheLastBen commented 8 months ago

make sure you run the first cell, it appears the right version of diffusers isn't installed correctly

sixpt commented 8 months ago

Thanks for the tip. I do run the cell every time, I've tried setting it to both 'True' and 'False.' Is there a particular directory I should try wiping to start from scratch?

sixpt commented 8 months ago

Nevermind. While I had tried starting a new machine a couple of times previously and kept getting the same error, after walking away for a couple hours and trying again on a new machine everything appears to be working. Thanks again for your attention to it!

sixpt commented 8 months ago

I lied. I got all the way to training the Unet and saw progress, but then got another error upon completion. I gave up and started a new notebook altogether and it appears to be working. Not sure what I did to my old one to break it.