Closed nyagano closed 7 months ago
It also happens to me. https://pastebin.com/j9JzV9a2
I wasted a ticket because of this error.
on line 32:
self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)
According to another issue, seems to fix. Haven't tried it though.
on line 32:
self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)
According to another issue, seems to fix. Haven't tried it though.
Oops forgot, its on class_command_executor
Sorry about that... I had to remove the shell=True because it is too insecure and can lead to exploit... but removing this is causing issues on Linux but Windows is fine... so I need to seek a solution. I think I have found it but will need your help to test... I am currently implementing a "hack" in a new branch until I can comb over the code to re-implement thing in a way that will better and secure... I put the security of the users 1st... so it might mean you will need to use the previous "unsecure" release for a bit... hope I can have something out as a quick fix today...
I also learned that the Gradio release I had to upgrade to... also due to security issues with previous gradio release... is causing an issue on runpod... I will bake the fix for that also in the next release.
I have pushed a quick PoC of the fix to the fix_accelerate_issue branch... only for Dreambooth at the moment... but if you can confirm it work on linux I would appreciate:
git pull
git checkout fix_accelerate_issue
OK... I have added the fix for the other trainers... but this is NOT perfect... as this will cause issues for inputs like training comments... the shlex validator mangle the parameter and cause a failure to execute... so I think I will need to invest way more time to fix this... this is turning into a nightmare.
Don't worry it happens, and i'm really sorry but I can't test it now as im low on runpod credits. And thanks for the effort in trying to fix it.
Thanks for everything. Does this mean it will not work in Lora?
Thanks for everything. Does this mean it will not work in Lora?
It should work for LoRA... at least in theory... Is this using the latest pull for the fix_accelerate_issue
branch?
Look like the error is different from the original master branch... so perhaps something else is causing it?
git pull git checkout fix_accelerate_issue Yes. I did this after running these
Can you paste the log info command and the full traceback after as text?
FileNotFoundError: [Errno 2] No such file or directory: 'accelerate launch --mixed_precision="bf16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=16 "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --clip_skip=2 --debiased_estimation_loss --enable_bucket --min_bucket_reso=320 --max_bucket_reso=4096 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --keep_tokens="1" --learning_rate="1.0" --logging_dir="/workspace/data/loggine" --loss_type="l2" --lr_scheduler="cosine_with_restarts" --lr_scheduler_num_cycles="3" --lr_warmup_steps="142" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_token_length=225 --max_train_steps="2847" --min_timestep=0 --mixed_precision="bf16" --network_alpha="32" --network_args preset="full" conv_dim="64" conv_alpha="32" rank_dropout="0" bypass_mode="False" dora_wd="False" module_dropout="0" use_tucker="False" use_scalar="False" rank_dropout_scale="False" algo="locon" train_norm="False" --network_dim=64 --network_module=lycoris.kohya --no_half_vae --optimizer_args decouple=True weight_decay=0.01 betas=[0.9,0.99] d_coef=2 use_bias_correction=True safeguard_warmup=True --optimizer_type="Prodigy" --output_dir="/workspace/data/output" --output_name="test" --pretrained_model_name_or_path="/workspace/data/model/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" --save_every_n_epochs="1" --save_every_n_steps="50" --save_model_as=safetensors --save_precision="bf16" --seed="1234" --text_encoder_lr=1 --train_batch_size="3" --train_data_dir="/workspace/data/image" --unet_lr=1 --xformers'
14:18:23-790314 INFO Start training LoRA LyCORIS/LoCon ...
14:18:23-792530 INFO Validating model file or folder path
/workspace/data/model/ponyDiffusionV6XL_v6StartWithThis
One.safetensors existence...
14:18:23-794914 INFO ...valid
14:18:23-796619 INFO Validating output_dir path /workspace/data/output
existence...
14:18:23-798669 INFO ...valid
14:18:23-800341 INFO Validating train_data_dir path /workspace/data/image
existence...
14:18:23-802392 INFO ...valid
14:18:23-803730 INFO reg_data_dir not specified, skipping validation
14:18:23-804475 INFO Validating logging_dir path /workspace/data/loggine
existence...
14:18:23-805325 INFO ...valid
14:18:23-806030 INFO log_tracker_config not specified, skipping validation
14:18:23-806798 INFO resume not specified, skipping validation
14:18:23-807533 INFO vae not specified, skipping validation
14:18:23-808257 INFO lora_network_weights not specified, skipping validation
14:18:23-809005 INFO dataset_config not specified, skipping validation
14:18:23-809751 INFO Headless mode, skipping verification if model already
exist... if model already exist it will be
overwritten...
14:18:23-810921 INFO Folder 7_outfit122: 122 images found
14:18:23-811698 INFO Folder 7_outfit122: 854 steps
14:18:23-812462 INFO Error: '.ipynb_checkpoints' does not contain an
underscore, skipping...
14:18:23-813330 INFO Total steps: 854
14:18:23-814058 INFO Train batch size: 3
14:18:23-814814 INFO Gradient accumulation steps: 1
14:18:23-815565 INFO Epoch: 10
14:18:23-816274 INFO Regulatization factor: 1
14:18:23-817016 INFO max_train_steps (854 / 3 / 1 * 10 * 1) = 2847
14:18:23-817938 INFO stop_text_encoder_training = 0
14:18:23-818714 INFO lr_warmup_steps = 142
14:18:23-819571 INFO Saving training config to
/workspace/data/output/test_20240413-141823.json...
14:18:23-820754 INFO accelerate launch --mixed_precision="bf16"
--num_processes=1 --num_machines=1
--num_cpu_threads_per_process=16
"/workspace/kohya_ss/sd-scripts/sdxl_train_network.py"
--bucket_no_upscale --bucket_reso_steps=64
--cache_latents --cache_latents_to_disk
--caption_extension=".txt" --clip_skip=2
--debiased_estimation_loss --enable_bucket
--min_bucket_reso=320 --max_bucket_reso=4096
--gradient_checkpointing --huber_c="0.1"
--huber_schedule="snr" --keep_tokens="1"
--learning_rate="1.0"
--logging_dir="/workspace/data/loggine"
--loss_type="l2" --lr_scheduler="cosine_with_restarts"
--lr_scheduler_num_cycles="3" --lr_warmup_steps="142"
--max_data_loader_n_workers="0" --max_grad_norm="1"
--resolution="1024,1024" --max_token_length=225
--max_train_steps="2847" --min_timestep=0
--mixed_precision="bf16" --network_alpha="32"
--network_args preset="full" conv_dim="64"
conv_alpha="32" rank_dropout="0" bypass_mode="False"
dora_wd="False" module_dropout="0" use_tucker="False"
use_scalar="False" rank_dropout_scale="False"
algo="locon" train_norm="False" --network_dim=64
--network_module=lycoris.kohya --no_half_vae
--optimizer_args decouple=True weight_decay=0.01
betas=[0.9,0.99] d_coef=2 use_bias_correction=True
safeguard_warmup=True --optimizer_type="Prodigy"
--output_dir="/workspace/data/output"
--output_name="test"
--pretrained_model_name_or_path="/workspace/data/model/
ponyDiffusionV6XL_v6StartWithThisOne.safetensors"
--save_every_n_epochs="1" --save_every_n_steps="50"
--save_model_as=safetensors --save_precision="bf16"
--seed="1234" --text_encoder_lr=1
--train_batch_size="3"
--train_data_dir="/workspace/data/image" --unet_lr=1
--xformers
Traceback (most recent call last):
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/route_utils.py", line 261, in call_process_api
output = await app.get_blocks().process_api(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1786, in process_api
result = await self.call_function(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1338, in call_function
prediction = await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/utils.py", line 759, in wrapper
response = f(*args, **kwargs)
File "/workspace/kohya_ss/kohya_gui/lora_gui.py", line 1072, in train_model
env["TF_ENABLE_ONEDNN_OPTS"] = "0"
File "/workspace/kohya_ss/kohya_gui/class_command_executor.py", line 31, in execute_command
if self.process and self.process.poll() is None:
File "/usr/lib/python3.10/subprocess.py", line 971, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'accelerate launch --mixed_precision="bf16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=16 "/workspace/kohya_ss/sd-scripts/sdxl_train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --clip_skip=2 --debiased_estimation_loss --enable_bucket --min_bucket_reso=320 --max_bucket_reso=4096 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --keep_tokens="1" --learning_rate="1.0" --logging_dir="/workspace/data/loggine" --loss_type="l2" --lr_scheduler="cosine_with_restarts" --lr_scheduler_num_cycles="3" --lr_warmup_steps="142" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_token_length=225 --max_train_steps="2847" --min_timestep=0 --mixed_precision="bf16" --network_alpha="32" --network_args preset="full" conv_dim="64" conv_alpha="32" rank_dropout="0" bypass_mode="False" dora_wd="False" module_dropout="0" use_tucker="False" use_scalar="False" rank_dropout_scale="False" algo="locon" train_norm="False" --network_dim=64 --network_module=lycoris.kohya --no_half_vae --optimizer_args decouple=True weight_decay=0.01 betas=[0.9,0.99] d_coef=2 use_bias_correction=True safeguard_warmup=True --optimizer_type="Prodigy" --output_dir="/workspace/data/output" --output_name="test" --pretrained_model_name_or_path="/workspace/data/model/ponyDiffusionV6XL_v6StartWithThisOne.safetensors" --save_every_n_epochs="1" --save_every_n_steps="50" --save_model_as=safetensors --save_precision="bf16" --seed="1234" --text_encoder_lr=1 --train_batch_size="3" --train_data_dir="/workspace/data/image" --unet_lr=1 --xformers'
Sorry for the text. It will look something like this.
OK... I don't think you use the latest code... make sure to redo a git pull
to get the latest from that branch
same error as kaggle
@bmaltais we test this branch any new parameter?
https://github.com/bmaltais/kohya_ss/tree/fix_accelerate_issue
@bmaltais we test this branch any new parameter?
https://github.com/bmaltais/kohya_ss/tree/fix_accelerate_issue
I just pushed a bunch of updates to the branch... do another git pull... If it still fail paste a copy of the command and error so I can review it. It is not perfect yet and I still have to discover all the bugs with the shlex.quote treatement I need to do to all user provided string values... very painful... but this is there to ensure no one can inject commands in a string to execute on a user system... so this is important to address...
Thank you for testing this with me.
OK... things are getting much better with the code now... let me know if you encounter any issues with the Dreambooth, Finetune or LoRA... I have not touched the utilities and tools yet...
Wow... this is turning into a major rewrite. I think I have nailed most of the Dreadbooth, LoRA, TI and Finetuning tabs... those should work relativelly OK... I can't test every possible things... so if you run into errors with those let me know... but it should work fine on both linux and windows machines...
I have reverted my Kohya_ss RunPod template to v23.1.3 and doing a new build of the Docker image to revert the Ultimate RunPod template to v23.1.3 but keep A1111 1.9.0.
RunPod Ultimate template 5.0.1 has been reverted to v23.1.3 as well. I'll do a new build of both RunPod templates once this issue is fully resolved.
RunPod Ultimate template 5.0.1 has been reverted to v23.1.3 as well. I'll do a new build of both RunPod templates once this issue is fully resolved.
Sound good. I think I have pretty much converted everything to work on linux and windows without shell=True... required a lot of refactoring. If you have a chance to test on runpod let me know. Given the huge amount of rework I am sure some bugs have crept in... but if enough test it then most should be rooted out quickly.
The fix are now in the dev
branch.
I just tested it and was able to learn. Thanks for all your efforts.
Great! Glad it is fixed.
dev
should work now...