bmaltais / kohya_ss

Apache License 2.0
9.69k stars 1.25k forks source link

runpod error #2278

Closed nyagano closed 7 months ago

nyagano commented 7 months ago

スクリーンショット 2024-04-13 200516

mi8m commented 7 months ago

It also happens to me. https://pastebin.com/j9JzV9a2

nyagano commented 7 months ago

I wasted a ticket because of this error.

mi8m commented 7 months ago

on line 32:

self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)

According to another issue, seems to fix. Haven't tried it though.

mi8m commented 7 months ago

on line 32:

self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)

According to another issue, seems to fix. Haven't tried it though.

Oops forgot, its on class_command_executor

bmaltais commented 7 months ago

Sorry about that... I had to remove the shell=True because it is too insecure and can lead to exploit... but removing this is causing issues on Linux but Windows is fine... so I need to seek a solution. I think I have found it but will need your help to test... I am currently implementing a "hack" in a new branch until I can comb over the code to re-implement thing in a way that will better and secure... I put the security of the users 1st... so it might mean you will need to use the previous "unsecure" release for a bit... hope I can have something out as a quick fix today...

I also learned that the Gradio release I had to upgrade to... also due to security issues with previous gradio release... is causing an issue on runpod... I will bake the fix for that also in the next release.

bmaltais commented 7 months ago

I have pushed a quick PoC of the fix to the fix_accelerate_issue branch... only for Dreambooth at the moment... but if you can confirm it work on linux I would appreciate:

git pull
git checkout fix_accelerate_issue
bmaltais commented 7 months ago

OK... I have added the fix for the other trainers... but this is NOT perfect... as this will cause issues for inputs like training comments... the shlex validator mangle the parameter and cause a failure to execute... so I think I will need to invest way more time to fix this... this is turning into a nightmare.

mi8m commented 7 months ago

Don't worry it happens, and i'm really sorry but I can't test it now as im low on runpod credits. And thanks for the effort in trying to fix it.

nyagano commented 7 months ago

スクリーンショット 2024-04-13 214939

Thanks for everything. Does this mean it will not work in Lora?

bmaltais commented 7 months ago

スクリーンショット 2024-04-13 214939

Thanks for everything. Does this mean it will not work in Lora?

It should work for LoRA... at least in theory... Is this using the latest pull for the fix_accelerate_issue branch?

Look like the error is different from the original master branch... so perhaps something else is causing it?

nyagano commented 7 months ago

git pull git checkout fix_accelerate_issue Yes. I did this after running these

bmaltais commented 7 months ago

Can you paste the log info command and the full traceback after as text?

nyagano commented 7 months ago
FileNotFoundError: [Errno 2] No such file or directory: 'accelerate launch --mixed_precision="bf16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=16 "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py"  --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --clip_skip=2 --debiased_estimation_loss --enable_bucket --min_bucket_reso=320 --max_bucket_reso=4096 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --keep_tokens="1" --learning_rate="1.0" --logging_dir="/workspace/data/loggine" --loss_type="l2" --lr_scheduler="cosine_with_restarts" --lr_scheduler_num_cycles="3" --lr_warmup_steps="142" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_token_length=225 --max_train_steps="2847" --min_timestep=0 --mixed_precision="bf16" --network_alpha="32" --network_args preset="full" conv_dim="64" conv_alpha="32" rank_dropout="0" bypass_mode="False" dora_wd="False" module_dropout="0" use_tucker="False" use_scalar="False" rank_dropout_scale="False" algo="locon" train_norm="False" --network_dim=64 --network_module=lycoris.kohya --no_half_vae --optimizer_args decouple=True weight_decay=0.01 betas=[0.9,0.99] d_coef=2 use_bias_correction=True safeguard_warmup=True --optimizer_type="Prodigy" --output_dir="/workspace/data/output" --output_name="test" --pretrained_model_name_or_path="/workspace/data/model/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" --save_every_n_epochs="1" --save_every_n_steps="50" --save_model_as=safetensors --save_precision="bf16" --seed="1234" --text_encoder_lr=1 --train_batch_size="3" --train_data_dir="/workspace/data/image" --unet_lr=1 --xformers'
14:18:23-790314 INFO     Start training LoRA LyCORIS/LoCon ...                  
14:18:23-792530 INFO     Validating model file or folder path                   
                         /workspace/data/model/ponyDiffusionV6XL_v6StartWithThis
                         One.safetensors existence...                           
14:18:23-794914 INFO     ...valid                                               
14:18:23-796619 INFO     Validating output_dir path /workspace/data/output      
                         existence...                                           
14:18:23-798669 INFO     ...valid                                               
14:18:23-800341 INFO     Validating train_data_dir path /workspace/data/image   
                         existence...                                           
14:18:23-802392 INFO     ...valid                                               
14:18:23-803730 INFO     reg_data_dir not specified, skipping validation        
14:18:23-804475 INFO     Validating logging_dir path /workspace/data/loggine    
                         existence...                                           
14:18:23-805325 INFO     ...valid                                               
14:18:23-806030 INFO     log_tracker_config not specified, skipping validation  
14:18:23-806798 INFO     resume not specified, skipping validation              
14:18:23-807533 INFO     vae not specified, skipping validation                 
14:18:23-808257 INFO     lora_network_weights not specified, skipping validation
14:18:23-809005 INFO     dataset_config not specified, skipping validation      
14:18:23-809751 INFO     Headless mode, skipping verification if model already  
                         exist... if model already exist it will be             
                         overwritten...                                         
14:18:23-810921 INFO     Folder 7_outfit122: 122 images found                   
14:18:23-811698 INFO     Folder 7_outfit122: 854 steps                          
14:18:23-812462 INFO     Error: '.ipynb_checkpoints' does not contain an        
                         underscore, skipping...                                
14:18:23-813330 INFO     Total steps: 854                                       
14:18:23-814058 INFO     Train batch size: 3                                    
14:18:23-814814 INFO     Gradient accumulation steps: 1                         
14:18:23-815565 INFO     Epoch: 10                                              
14:18:23-816274 INFO     Regulatization factor: 1                               
14:18:23-817016 INFO     max_train_steps (854 / 3 / 1 * 10 * 1) = 2847          
14:18:23-817938 INFO     stop_text_encoder_training = 0                         
14:18:23-818714 INFO     lr_warmup_steps = 142                                  
14:18:23-819571 INFO     Saving training config to                              
                         /workspace/data/output/test_20240413-141823.json...    
14:18:23-820754 INFO     accelerate launch --mixed_precision="bf16"             
                         --num_processes=1 --num_machines=1                     
                         --num_cpu_threads_per_process=16                       
                         "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py" 
                         --bucket_no_upscale --bucket_reso_steps=64             
                         --cache_latents --cache_latents_to_disk                
                         --caption_extension=".txt" --clip_skip=2               
                         --debiased_estimation_loss --enable_bucket             
                         --min_bucket_reso=320 --max_bucket_reso=4096           
                         --gradient_checkpointing --huber_c="0.1"               
                         --huber_schedule="snr" --keep_tokens="1"               
                         --learning_rate="1.0"                                  
                         --logging_dir="/workspace/data/loggine"                
                         --loss_type="l2" --lr_scheduler="cosine_with_restarts" 
                         --lr_scheduler_num_cycles="3" --lr_warmup_steps="142"  
                         --max_data_loader_n_workers="0" --max_grad_norm="1"    
                         --resolution="1024,1024" --max_token_length=225        
                         --max_train_steps="2847" --min_timestep=0              
                         --mixed_precision="bf16" --network_alpha="32"          
                         --network_args preset="full" conv_dim="64"             
                         conv_alpha="32" rank_dropout="0" bypass_mode="False"   
                         dora_wd="False" module_dropout="0" use_tucker="False"  
                         use_scalar="False" rank_dropout_scale="False"          
                         algo="locon" train_norm="False" --network_dim=64       
                         --network_module=lycoris.kohya --no_half_vae           
                         --optimizer_args decouple=True weight_decay=0.01       
                         betas=[0.9,0.99] d_coef=2 use_bias_correction=True     
                         safeguard_warmup=True --optimizer_type="Prodigy"       
                         --output_dir="/workspace/data/output"                  
                         --output_name="test"                                   
                         --pretrained_model_name_or_path="/workspace/data/model/
                         ponyDiffusionV6XL_v6StartWithThisOne.safetensors"      
                         --save_every_n_epochs="1" --save_every_n_steps="50"    
                         --save_model_as=safetensors --save_precision="bf16"    
                         --seed="1234" --text_encoder_lr=1                      
                         --train_batch_size="3"                                 
                         --train_data_dir="/workspace/data/image" --unet_lr=1   
                         --xformers                                             
Traceback (most recent call last):
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/queueing.py", line 527, in process_events
    response = await route_utils.call_process_api(
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/route_utils.py", line 261, in call_process_api
    output = await app.get_blocks().process_api(
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1786, in process_api
    result = await self.call_function(
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1338, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/utils.py", line 759, in wrapper
    response = f(*args, **kwargs)
  File "/workspace/kohya_ss/kohya_gui/lora_gui.py", line 1072, in train_model
    env["TF_ENABLE_ONEDNN_OPTS"] = "0"
  File "/workspace/kohya_ss/kohya_gui/class_command_executor.py", line 31, in execute_command
    if self.process and self.process.poll() is None:
  File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'accelerate launch --mixed_precision="bf16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=16 "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py"  --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --clip_skip=2 --debiased_estimation_loss --enable_bucket --min_bucket_reso=320 --max_bucket_reso=4096 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --keep_tokens="1" --learning_rate="1.0" --logging_dir="/workspace/data/loggine" --loss_type="l2" --lr_scheduler="cosine_with_restarts" --lr_scheduler_num_cycles="3" --lr_warmup_steps="142" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_token_length=225 --max_train_steps="2847" --min_timestep=0 --mixed_precision="bf16" --network_alpha="32" --network_args preset="full" conv_dim="64" conv_alpha="32" rank_dropout="0" bypass_mode="False" dora_wd="False" module_dropout="0" use_tucker="False" use_scalar="False" rank_dropout_scale="False" algo="locon" train_norm="False" --network_dim=64 --network_module=lycoris.kohya --no_half_vae --optimizer_args decouple=True weight_decay=0.01 betas=[0.9,0.99] d_coef=2 use_bias_correction=True safeguard_warmup=True --optimizer_type="Prodigy" --output_dir="/workspace/data/output" --output_name="test" --pretrained_model_name_or_path="/workspace/data/model/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" --save_every_n_epochs="1" --save_every_n_steps="50" --save_model_as=safetensors --save_precision="bf16" --seed="1234" --text_encoder_lr=1 --train_batch_size="3" --train_data_dir="/workspace/data/image" --unet_lr=1 --xformers'

Sorry for the text. It will look something like this.

bmaltais commented 7 months ago

OK... I don't think you use the latest code... make sure to redo a git pull to get the latest from that branch

FurkanGozukara commented 7 months ago

same error as kaggle

FurkanGozukara commented 7 months ago

@bmaltais we test this branch any new parameter?

https://github.com/bmaltais/kohya_ss/tree/fix_accelerate_issue

bmaltais commented 7 months ago

@bmaltais we test this branch any new parameter?

https://github.com/bmaltais/kohya_ss/tree/fix_accelerate_issue

I just pushed a bunch of updates to the branch... do another git pull... If it still fail paste a copy of the command and error so I can review it. It is not perfect yet and I still have to discover all the bugs with the shlex.quote treatement I need to do to all user provided string values... very painful... but this is there to ensure no one can inject commands in a string to execute on a user system... so this is important to address...

Thank you for testing this with me.

bmaltais commented 7 months ago

OK... things are getting much better with the code now... let me know if you encounter any issues with the Dreambooth, Finetune or LoRA... I have not touched the utilities and tools yet...

bmaltais commented 7 months ago

Wow... this is turning into a major rewrite. I think I have nailed most of the Dreadbooth, LoRA, TI and Finetuning tabs... those should work relativelly OK... I can't test every possible things... so if you run into errors with those let me know... but it should work fine on both linux and windows machines...

ashleykleynhans commented 7 months ago

I have reverted my Kohya_ss RunPod template to v23.1.3 and doing a new build of the Docker image to revert the Ultimate RunPod template to v23.1.3 but keep A1111 1.9.0.

ashleykleynhans commented 7 months ago

RunPod Ultimate template 5.0.1 has been reverted to v23.1.3 as well. I'll do a new build of both RunPod templates once this issue is fully resolved.

bmaltais commented 7 months ago

RunPod Ultimate template 5.0.1 has been reverted to v23.1.3 as well. I'll do a new build of both RunPod templates once this issue is fully resolved.

Sound good. I think I have pretty much converted everything to work on linux and windows without shell=True... required a lot of refactoring. If you have a chance to test on runpod let me know. Given the huge amount of rework I am sure some bugs have crept in... but if enough test it then most should be rooted out quickly.

bmaltais commented 7 months ago

The fix are now in the dev branch.

nyagano commented 7 months ago

I just tested it and was able to learn. Thanks for all your efforts.

bmaltais commented 7 months ago

Great! Glad it is fixed.

bmaltais commented 7 months ago

dev should work now...