bmaltais / kohya_ss

Apache License 2.0
9.55k stars 1.23k forks source link

accelerate command missing ! on Kaggle #2271

Closed FurkanGozukara closed 6 months ago

FurkanGozukara commented 6 months ago

The code was working previously but since you made some system changes @bmaltais not working anymore

Issue should be easy to fix

When I click start training, it tries to run like this

image

But it can't find the command like below

image

When I copy paste it into a new cell and run as below it works perfectly fine. can you fix this urgently if possible?

image

bmaltais commented 6 months ago

It is working fine on my local machine... I had to remove the shell=TRUE because it was reported as a security vulnerability... but perhaps this is causing the run issue?

FurkanGozukara commented 6 months ago

It is working fine on my local machine... I had to remove the shell=TRUE because it was reported as a security vulnerability... but perhaps this is causing the run issue?

very probably. how can we enable it? this is on Kaggle

bmaltais commented 6 months ago

This is the thing, there is an official CVE advisory for the GUI and I had to fix it... but how can this be fixed and not cause this issue with kaggle... perhaps I can add another flag to set the shell=True if one is passing a --run-insecure-shell flag to the gui?

FurkanGozukara commented 6 months ago

--run-insecure-shell

would work totally

bmaltais commented 6 months ago

I have created a branch to build the feeature... I will post back when it can be tested... What do you train at the moment? Dreambooth? LoRa? I will implement if for that 1st

bmaltais commented 6 months ago

OK, give it a test:

git fetch origin
git checkout 2271-accelerate-command-missing-on-kaggle
FurkanGozukara commented 6 months ago

OK, give it a test:

git fetch origin
git checkout 2271-accelerate-command-missing-on-kaggle

sadly didnt fix

it shows ' at the beginning and ' at the end

could that be reason?

Already on '2271-accelerate-command-missing-on-kaggle'
Your branch is up to date with 'origin/2271-accelerate-command-missing-on-kaggle'.
venv folder does not exist. Not activating...
02:11:35-118993 INFO     Kohya_ss GUI version: v23.1.6                          
02:11:35-220138 INFO     Submodule initialized and updated.                     
02:11:35-222165 INFO     nVidia toolkit detected                                
02:11:37-666674 INFO     Torch 2.1.2+cu118                                      
02:11:37-721434 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700             
02:11:37-749660 INFO     Torch detected GPU: Tesla T4 VRAM 15102 Arch (7, 5)    
                         Cores 40                                               
02:11:37-751953 INFO     Torch detected GPU: Tesla T4 VRAM 15102 Arch (7, 5)    
                         Cores 40                                               
02:11:37-757186 INFO     Python version is 3.10.13 | packaged by conda-forge |  
                         (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]             
02:11:37-759623 INFO     Verifying modules installation status from             
                         /kaggle/working/kohya_ss/requirements_linux.txt...     
02:11:37-763891 INFO     Verifying modules installation status from             
                         requirements.txt...                                    
02:11:48-882172 INFO     headless: False                                        
Running on local URL:  http://127.0.0.1:7860/

To create a public link, set `share=True` in `launch()`.
02:12:31-217566 INFO     Loading config...                                      
/opt/conda/lib/python3.10/site-packages/gradio/components/dropdown.py:173: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include:  or set allow_custom_value=True.
  warnings.warn(
02:12:31-730220 INFO     SDXL model selected. Setting sdxl parameters           
02:12:40-885073 INFO     Start training Dreambooth...                           
02:12:40-886942 INFO     Validating model file or folder path                   
                         stabilityai/stable-diffusion-xl-base-1.0 existence...  
02:12:40-889022 INFO     ...valid                                               
02:12:40-890312 INFO     Validating output_dir path                             
                         /kaggle/working/results/model existence...             
02:12:40-891943 INFO     ...valid                                               
02:12:40-893196 INFO     Validating train_data_dir path                         
                         /kaggle/working/results/img existence...               
02:12:40-895042 INFO     ...valid                                               
02:12:40-896311 INFO     Validating reg_data_dir path                           
                         /kaggle/working/results/reg existence...               
02:12:40-897943 INFO     ...valid                                               
02:12:40-899179 INFO     Validating logging_dir path /kaggle/working/results/log
                         existence...                                           
02:12:40-900749 INFO     ...valid                                               
02:12:40-901921 INFO     log_tracker_config not specified, skipping validation  
02:12:40-903353 INFO     resume not specified, skipping validation              
02:12:40-904707 INFO     Checking vae... huggingface.co model, skipping         
                         validation                                             
02:12:40-906307 INFO     dataset_config not specified, skipping validation      
02:12:40-907776 INFO     Folder 100_ohwx man : steps 1500                       
02:12:40-909186 INFO     Regularisation images are used... Will double the      
                         number of steps required...                            
02:12:40-910894 INFO     max_train_steps (1500 / 1 / 1 * 1 * 2) = 3000          
02:12:40-912591 INFO     stop_text_encoder_training = 0                         
02:12:40-913974 INFO     lr_warmup_steps = 300                                  
02:12:40-915436 INFO     Can't use LR warmup with LR Scheduler constant...      
                         ignoring...                                            
02:12:40-917664 WARNING  Here is the trainer command as a reference. It will not
                         be executed:                                           

accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=4 "/kaggle/working/kohya_ss/sd-scripts/sdxl_train.py" --max_grad_norm=0.0 --no_half_vae --train_text_encoder --ddp_timeout=10000000 --ddp_gradient_as_bucket_view --learning_rate_te2="0"  --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --full_fp16 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --learning_rate="1e-05" --learning_rate_te1="3e-06" --learning_rate_te2="0" --logging_dir="/kaggle/working/results/log" --loss_type="l2" --lr_scheduler="constant" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3000" --mem_eff_attn --min_timestep=0 --mixed_precision="fp16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --optimizer_type="Adafactor" --output_dir="/kaggle/working/results/model" --output_name="My_DB_Kaggle" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --reg_data_dir="/kaggle/working/results/reg" --save_every_n_epochs="1" --save_every_n_steps="1300" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="1" --train_data_dir="/kaggle/working/results/img" --xformers
02:12:48-732112 INFO     Start training Dreambooth...                           
02:12:48-733761 INFO     Validating model file or folder path                   
                         stabilityai/stable-diffusion-xl-base-1.0 existence...  
02:12:48-735586 INFO     ...valid                                               
02:12:48-736905 INFO     Validating output_dir path                             
                         /kaggle/working/results/model existence...             
02:12:48-738537 INFO     ...valid                                               
02:12:48-740062 INFO     Validating train_data_dir path                         
                         /kaggle/working/results/img existence...               
02:12:48-741963 INFO     ...valid                                               
02:12:48-743220 INFO     Validating reg_data_dir path                           
                         /kaggle/working/results/reg existence...               
02:12:48-744655 INFO     ...valid                                               
02:12:48-746097 INFO     Validating logging_dir path /kaggle/working/results/log
                         existence...                                           
02:12:48-747757 INFO     ...valid                                               
02:12:48-748985 INFO     log_tracker_config not specified, skipping validation  
02:12:48-750435 INFO     resume not specified, skipping validation              
02:12:48-751729 INFO     Checking vae... huggingface.co model, skipping         
                         validation                                             
02:12:48-753137 INFO     dataset_config not specified, skipping validation      
02:12:48-754655 INFO     Folder 100_ohwx man : steps 1500                       
02:12:48-756047 INFO     Regularisation images are used... Will double the      
                         number of steps required...                            
02:12:48-757669 INFO     max_train_steps (1500 / 1 / 1 * 1 * 2) = 3000          
02:12:48-759201 INFO     stop_text_encoder_training = 0                         
02:12:48-760531 INFO     lr_warmup_steps = 300                                  
02:12:48-761826 INFO     Can't use LR warmup with LR Scheduler constant...      
                         ignoring...                                            
02:12:48-763608 INFO     Saving training config to                              
                         /kaggle/working/results/model/My_DB_Kaggle_20240413-021
                         248.json...                                            
02:12:48-765815 INFO     accelerate launch --mixed_precision="fp16"             
                         --num_processes=1 --num_machines=1                     
                         --num_cpu_threads_per_process=4                        
                         "/kaggle/working/kohya_ss/sd-scripts/sdxl_train.py"    
                         --max_grad_norm=0.0 --no_half_vae --train_text_encoder 
                         --ddp_timeout=10000000 --ddp_gradient_as_bucket_view   
                         --learning_rate_te2="0"  --bucket_no_upscale           
                         --bucket_reso_steps=64 --cache_latents                 
                         --cache_latents_to_disk --full_fp16                    
                         --gradient_checkpointing --huber_c="0.1"               
                         --huber_schedule="snr" --learning_rate="1e-05"         
                         --learning_rate_te1="3e-06" --learning_rate_te2="0"    
                         --logging_dir="/kaggle/working/results/log"            
                         --loss_type="l2" --lr_scheduler="constant"             
                         --lr_scheduler_num_cycles="1"                          
                         --max_data_loader_n_workers="0"                        
                         --resolution="1024,1024" --max_train_steps="3000"      
                         --mem_eff_attn --min_timestep=0                        
                         --mixed_precision="fp16" --optimizer_args              
                         scale_parameter=False relative_step=False              
                         warmup_init=False weight_decay=0.01                    
                         --optimizer_type="Adafactor"                           
                         --output_dir="/kaggle/working/results/model"           
                         --output_name="My_DB_Kaggle"                           
                         --pretrained_model_name_or_path="stabilityai/stable-dif
                         fusion-xl-base-1.0"                                    
                         --reg_data_dir="/kaggle/working/results/reg"           
                         --save_every_n_epochs="1" --save_every_n_steps="1300"  
                         --save_model_as=safetensors --save_precision="fp16"    
                         --train_batch_size="1"                                 
                         --train_data_dir="/kaggle/working/results/img"         
                         --xformers                                             
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/gradio/queueing.py", line 495, in call_prediction
    output = await route_utils.call_process_api(
  File "/opt/conda/lib/python3.10/site-packages/gradio/route_utils.py", line 235, in call_process_api
    output = await app.get_blocks().process_api(
  File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1627, in process_api
    result = await self.call_function(
  File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1173, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/opt/conda/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 690, in wrapper
    response = f(*args, **kwargs)
  File "/kaggle/working/kohya_ss/kohya_gui/dreambooth_gui.py", line 731, in train_model
    executor.execute_command(run_cmd=run_cmd, env=env)
  File "/kaggle/working/kohya_ss/kohya_gui/class_command_executor.py", line 32, in execute_command
    self.process = subprocess.Popen(run_cmd, **kwargs)
  File "/opt/conda/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/conda/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=4 "/kaggle/working/kohya_ss/sd-scripts/sdxl_train.py" --max_grad_norm=0.0 --no_half_vae --train_text_encoder --ddp_timeout=10000000 --ddp_gradient_as_bucket_view --learning_rate_te2="0"  --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --full_fp16 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --learning_rate="1e-05" --learning_rate_te1="3e-06" --learning_rate_te2="0" --logging_dir="/kaggle/working/results/log" --loss_type="l2" --lr_scheduler="constant" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3000" --mem_eff_attn --min_timestep=0 --mixed_precision="fp16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --optimizer_type="Adafactor" --output_dir="/kaggle/working/results/model" --output_name="My_DB_Kaggle" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --reg_data_dir="/kaggle/working/results/reg" --save_every_n_epochs="1" --save_every_n_steps="1300" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="1" --train_data_dir="/kaggle/working/results/img" --xformers'
FurkanGozukara commented 6 months ago

@bmaltais

on kaggle fail vs work

image

FurkanGozukara commented 6 months ago

maybe i need to add headless gonna try and let you know

sorry for this

bmaltais commented 6 months ago

Strange... so perhaps not related to the update then... Did something change on kaggle?

scMudboy commented 6 months ago

I have this same issue on ubuntu 22.04 training lora after update.

bmaltais commented 6 months ago

I have this same issue on ubuntu 22.04 training lora.

On a local Ubuntu or on a cloud service?

scMudboy commented 6 months ago

Local

bmaltais commented 6 months ago

Really weird… and running accelerate launch in the vent work?

scMudboy commented 6 months ago

running accelerate launch in venv does work showing me the options.

bmaltais commented 6 months ago

So this must be the shell=True that has been removed... you could try to add it manally to the class_command_executor to see if this fix the issue...

on line 32:

self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)

scMudboy commented 6 months ago

So this must be the shell=True that has been removed... you could try to add it manally to the class_command_executor to see if this fix the issue...

on line 32:

self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)

This did corrct things for me. Thanks

sanghyeonback commented 6 months ago

So this must be the shell=True that has been removed... you could try to add it manally to the class_command_executor to see if this fix the issue...

on line 32:

self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)

I had the same problem on Runpod. but this solved the problem.

bmaltais commented 6 months ago

Ok, so at least now we know the root cause… but I can’t revert to how things were before because of the security risk it poses according to the security report I received… so I will need to find a different solution than just reinstating the shell=True parameter…

FurkanGozukara commented 6 months ago

Ok, so at least now we know the root cause… but I can’t revert to how things were before because of the security risk it poses according to the security report I received… so I will need to find a different solution than just reinstating the shell=True parameter…

i am glad you found the error. when can we expect a solution? i am fine with unsecure access :D

bmaltais commented 6 months ago

Things should work in the dev branch now. Major code rewrite but for the best. Was long overdue.

FurkanGozukara commented 6 months ago

@bmaltais it works when you can merge in?

also when doing multi gpu training it makes 2 epoch and save 2 checkpoints still even though i train 1 epoch

image

bmaltais commented 6 months ago

For the multi flu issue this is really something that should be sorted with kohya.

I will need to do a few more things and bug fix to dev while users report issues with it. I would rather we take some time to test it right before making another release with multiple issues.

I need to see if I. An fix the Start/Stop training button issue where it will not go back to Start once the training is completed… it is annoying… If I can’t fix it I will probably go back to two separate buttons.