AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
142.5k stars 26.88k forks source link

[Bug]: Multi-GPU Dreambooth Training based on Accelerator does not work! #7843

Closed Dinxin closed 1 year ago

Dinxin commented 1 year ago

Is there an existing issue for this?

What happened?

I used the multi-GPU training function provided by accelerator library to reduce the training time of dreambooth.

This is the content of default_config.yaml

commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 2,3
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Actually, I want to use two of four GPUs available (gpu_ids=2,3) to conduct the training.

Unfortunately, I ran into the following two problems:

WeChatWorkScreenshot_e43073fb-7459-44b6-b8c6-5c32a3117512 WeChatWorkScreenshot_9f1e15c1-0f52-4ad2-9ddc-9dd5ed299997

The startup named webui-user-cuda1.sh is:


accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

Steps to reproduce the problem

  1. Write the following content into conf/default_config.yaml:

    
    command_file: null
    commands: null
    compute_environment: LOCAL_MACHINE
    deepspeed_config: {}
    distributed_type: MULTI_GPU
    downcast_bf16: 'no'
    dynamo_backend: 'NO'
    fsdp_config: {}
    gpu_ids: 2,3
    machine_rank: 0
    main_process_ip: null
    main_process_port: null
    main_training_function: main
    megatron_lm_config: {}
    mixed_precision: 'no'
    num_machines: 1
    num_processes: 2
    rdzv_backend: static
    same_network: true
    tpu_name: null
    tpu_zone: null
    use_cpu: false```
  2. Run the following command:

accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access


### What should have happened?

I didn't see the correct multi-gpu training result.

### Commit where the problem happens

9e3584f0edd2e64d284b6aaf9580ade5dcceed9d

### What platforms do you use to access the UI ?

Linux

### What browsers do you use to access the UI ?

Google Chrome

### Command Line Arguments

```Shell
GRADIO_SERVER_PORT=8081

accelerate launch --config_file conf/default_config.yaml launch.py --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

List of extensions

dreambooth

Console logs

Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8
Python 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Commit hash: 7f8ab1ee8f304031b3404e25761dd0f4c7be7df8

#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision: 

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################

Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access

#######################################################################################################
Initializing Dreambooth
If submitting an issue on github, please provide the below text for debugging purposes:

Python revision: 3.8.0 (default, Nov  6 2019, 21:49:08) 
[GCC 7.3.0]
Dreambooth revision: 9e3584f0edd2e64d284b6aaf9580ade5dcceed9d
SD-WebUI revision: 

Checking Dreambooth requirements...
[+] bitsandbytes version 0.35.0 installed.
[+] diffusers version 0.10.2 installed.
[+] transformers version 4.25.1 installed.
[+] xformers version 0.0.13 installed.
[+] torch version 1.12.0 installed.
[+] torchvision version 0.13.0 installed.
#######################################################################################################

Launching Web UI with arguments: --listen --gradio-auth fine-art:AIGConTaiji --enable-insecure-extension-access
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
libcudart.so.11.0: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libcudart.so.11.0: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
Dreambooth API layer loaded
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Loading weights [e6e8e1fc] from /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/Stable-diffusion/final-pruned.ckpt
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Applying cross attention optimization (Doggettx).
Model loaded.
Loaded a total of 1 textual inversion embeddings.
Embeddings: Candy Hearts
Running on local URL:  http://0.0.0.0:8081

To create a public link, set `share=True` in `launch()`.
Running on local URL:  http://0.0.0.0:8082

To create a public link, set `share=True` in `launch()`.
Loading model from checkpoint.
Loading checkpoint...
v1 model loaded.
Creating scheduler...
Converting unet...
Converting vae...
Converting text encoder...
Saving diffusers model...
 Restored system models. 
 Allocated: 2.0GB 
 Reserved: 2.0GB 

 Allocated 2.0/2.0GB 
 Reserved: 2.0/2.0GB 

Checkpoint successfully extracted to /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/working
Concept 0 class dir is /cephfs/group/ieg-vc-vc-analysis/marcuschen/code/stable-diffusion-webui/models/dreambooth/xiaogong_girl/classifiers_0
Starting Dreambooth training...
 Allocated 0.0/2.0GB 
 Reserved: 0.0/2.0GB 

Initializing dreambooth training...
Patching transformers to fix kwargs errors.
/root/anaconda3/envs/novelai/lib/python3.8/site-packages/transformers/generation_utils.py:24: FutureWarning: Importing `GenerationMixin` from `src/transformers/generation_utils.py` is deprecated and will be removed in Transformers v5. Import as `from transformers import GenerationMixin` instead.
  warnings.warn(
Replace CrossAttention.forward to use default

Additional information

No response

SavvaI commented 1 year ago

I have exactly the same problem is there a plan to add support for the multi-gpu setting? Because the size of the batch seems to be extremely important for quality when fine-tuning.

fernando-deka commented 1 year ago

We are also interested in development of this feature