Closed FurkanGozukara closed 6 months ago
It is working fine on my local machine... I had to remove the shell=TRUE
because it was reported as a security vulnerability... but perhaps this is causing the run issue?
It is working fine on my local machine... I had to remove the
shell=TRUE
because it was reported as a security vulnerability... but perhaps this is causing the run issue?
very probably. how can we enable it? this is on Kaggle
This is the thing, there is an official CVE advisory for the GUI and I had to fix it... but how can this be fixed and not cause this issue with kaggle... perhaps I can add another flag to set the shell=True if one is passing a --run-insecure-shell flag to the gui?
--run-insecure-shell
would work totally
I have created a branch to build the feeature... I will post back when it can be tested... What do you train at the moment? Dreambooth? LoRa? I will implement if for that 1st
OK, give it a test:
git fetch origin
git checkout 2271-accelerate-command-missing-on-kaggle
OK, give it a test:
git fetch origin git checkout 2271-accelerate-command-missing-on-kaggle
sadly didnt fix
it shows '
at the beginning and '
at the end
could that be reason?
Already on '2271-accelerate-command-missing-on-kaggle'
Your branch is up to date with 'origin/2271-accelerate-command-missing-on-kaggle'.
venv folder does not exist. Not activating...
02:11:35-118993 INFO Kohya_ss GUI version: v23.1.6
02:11:35-220138 INFO Submodule initialized and updated.
02:11:35-222165 INFO nVidia toolkit detected
02:11:37-666674 INFO Torch 2.1.2+cu118
02:11:37-721434 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700
02:11:37-749660 INFO Torch detected GPU: Tesla T4 VRAM 15102 Arch (7, 5)
Cores 40
02:11:37-751953 INFO Torch detected GPU: Tesla T4 VRAM 15102 Arch (7, 5)
Cores 40
02:11:37-757186 INFO Python version is 3.10.13 | packaged by conda-forge |
(main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
02:11:37-759623 INFO Verifying modules installation status from
/kaggle/working/kohya_ss/requirements_linux.txt...
02:11:37-763891 INFO Verifying modules installation status from
requirements.txt...
02:11:48-882172 INFO headless: False
Running on local URL: http://127.0.0.1:7860/
To create a public link, set `share=True` in `launch()`.
02:12:31-217566 INFO Loading config...
/opt/conda/lib/python3.10/site-packages/gradio/components/dropdown.py:173: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: or set allow_custom_value=True.
warnings.warn(
02:12:31-730220 INFO SDXL model selected. Setting sdxl parameters
02:12:40-885073 INFO Start training Dreambooth...
02:12:40-886942 INFO Validating model file or folder path
stabilityai/stable-diffusion-xl-base-1.0 existence...
02:12:40-889022 INFO ...valid
02:12:40-890312 INFO Validating output_dir path
/kaggle/working/results/model existence...
02:12:40-891943 INFO ...valid
02:12:40-893196 INFO Validating train_data_dir path
/kaggle/working/results/img existence...
02:12:40-895042 INFO ...valid
02:12:40-896311 INFO Validating reg_data_dir path
/kaggle/working/results/reg existence...
02:12:40-897943 INFO ...valid
02:12:40-899179 INFO Validating logging_dir path /kaggle/working/results/log
existence...
02:12:40-900749 INFO ...valid
02:12:40-901921 INFO log_tracker_config not specified, skipping validation
02:12:40-903353 INFO resume not specified, skipping validation
02:12:40-904707 INFO Checking vae... huggingface.co model, skipping
validation
02:12:40-906307 INFO dataset_config not specified, skipping validation
02:12:40-907776 INFO Folder 100_ohwx man : steps 1500
02:12:40-909186 INFO Regularisation images are used... Will double the
number of steps required...
02:12:40-910894 INFO max_train_steps (1500 / 1 / 1 * 1 * 2) = 3000
02:12:40-912591 INFO stop_text_encoder_training = 0
02:12:40-913974 INFO lr_warmup_steps = 300
02:12:40-915436 INFO Can't use LR warmup with LR Scheduler constant...
ignoring...
02:12:40-917664 WARNING Here is the trainer command as a reference. It will not
be executed:
accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=4 "/kaggle/working/kohya_ss/sd-scripts/sdxl_train.py" --max_grad_norm=0.0 --no_half_vae --train_text_encoder --ddp_timeout=10000000 --ddp_gradient_as_bucket_view --learning_rate_te2="0" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --full_fp16 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --learning_rate="1e-05" --learning_rate_te1="3e-06" --learning_rate_te2="0" --logging_dir="/kaggle/working/results/log" --loss_type="l2" --lr_scheduler="constant" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3000" --mem_eff_attn --min_timestep=0 --mixed_precision="fp16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --optimizer_type="Adafactor" --output_dir="/kaggle/working/results/model" --output_name="My_DB_Kaggle" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --reg_data_dir="/kaggle/working/results/reg" --save_every_n_epochs="1" --save_every_n_steps="1300" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="1" --train_data_dir="/kaggle/working/results/img" --xformers
02:12:48-732112 INFO Start training Dreambooth...
02:12:48-733761 INFO Validating model file or folder path
stabilityai/stable-diffusion-xl-base-1.0 existence...
02:12:48-735586 INFO ...valid
02:12:48-736905 INFO Validating output_dir path
/kaggle/working/results/model existence...
02:12:48-738537 INFO ...valid
02:12:48-740062 INFO Validating train_data_dir path
/kaggle/working/results/img existence...
02:12:48-741963 INFO ...valid
02:12:48-743220 INFO Validating reg_data_dir path
/kaggle/working/results/reg existence...
02:12:48-744655 INFO ...valid
02:12:48-746097 INFO Validating logging_dir path /kaggle/working/results/log
existence...
02:12:48-747757 INFO ...valid
02:12:48-748985 INFO log_tracker_config not specified, skipping validation
02:12:48-750435 INFO resume not specified, skipping validation
02:12:48-751729 INFO Checking vae... huggingface.co model, skipping
validation
02:12:48-753137 INFO dataset_config not specified, skipping validation
02:12:48-754655 INFO Folder 100_ohwx man : steps 1500
02:12:48-756047 INFO Regularisation images are used... Will double the
number of steps required...
02:12:48-757669 INFO max_train_steps (1500 / 1 / 1 * 1 * 2) = 3000
02:12:48-759201 INFO stop_text_encoder_training = 0
02:12:48-760531 INFO lr_warmup_steps = 300
02:12:48-761826 INFO Can't use LR warmup with LR Scheduler constant...
ignoring...
02:12:48-763608 INFO Saving training config to
/kaggle/working/results/model/My_DB_Kaggle_20240413-021
248.json...
02:12:48-765815 INFO accelerate launch --mixed_precision="fp16"
--num_processes=1 --num_machines=1
--num_cpu_threads_per_process=4
"/kaggle/working/kohya_ss/sd-scripts/sdxl_train.py"
--max_grad_norm=0.0 --no_half_vae --train_text_encoder
--ddp_timeout=10000000 --ddp_gradient_as_bucket_view
--learning_rate_te2="0" --bucket_no_upscale
--bucket_reso_steps=64 --cache_latents
--cache_latents_to_disk --full_fp16
--gradient_checkpointing --huber_c="0.1"
--huber_schedule="snr" --learning_rate="1e-05"
--learning_rate_te1="3e-06" --learning_rate_te2="0"
--logging_dir="/kaggle/working/results/log"
--loss_type="l2" --lr_scheduler="constant"
--lr_scheduler_num_cycles="1"
--max_data_loader_n_workers="0"
--resolution="1024,1024" --max_train_steps="3000"
--mem_eff_attn --min_timestep=0
--mixed_precision="fp16" --optimizer_args
scale_parameter=False relative_step=False
warmup_init=False weight_decay=0.01
--optimizer_type="Adafactor"
--output_dir="/kaggle/working/results/model"
--output_name="My_DB_Kaggle"
--pretrained_model_name_or_path="stabilityai/stable-dif
fusion-xl-base-1.0"
--reg_data_dir="/kaggle/working/results/reg"
--save_every_n_epochs="1" --save_every_n_steps="1300"
--save_model_as=safetensors --save_precision="fp16"
--train_batch_size="1"
--train_data_dir="/kaggle/working/results/img"
--xformers
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/gradio/queueing.py", line 495, in call_prediction
output = await route_utils.call_process_api(
File "/opt/conda/lib/python3.10/site-packages/gradio/route_utils.py", line 235, in call_process_api
output = await app.get_blocks().process_api(
File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1627, in process_api
result = await self.call_function(
File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1173, in call_function
prediction = await anyio.to_thread.run_sync(
File "/opt/conda/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 690, in wrapper
response = f(*args, **kwargs)
File "/kaggle/working/kohya_ss/kohya_gui/dreambooth_gui.py", line 731, in train_model
executor.execute_command(run_cmd=run_cmd, env=env)
File "/kaggle/working/kohya_ss/kohya_gui/class_command_executor.py", line 32, in execute_command
self.process = subprocess.Popen(run_cmd, **kwargs)
File "/opt/conda/lib/python3.10/subprocess.py", line 971, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/opt/conda/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=4 "/kaggle/working/kohya_ss/sd-scripts/sdxl_train.py" --max_grad_norm=0.0 --no_half_vae --train_text_encoder --ddp_timeout=10000000 --ddp_gradient_as_bucket_view --learning_rate_te2="0" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --full_fp16 --gradient_checkpointing --huber_c="0.1" --huber_schedule="snr" --learning_rate="1e-05" --learning_rate_te1="3e-06" --learning_rate_te2="0" --logging_dir="/kaggle/working/results/log" --loss_type="l2" --lr_scheduler="constant" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --resolution="1024,1024" --max_train_steps="3000" --mem_eff_attn --min_timestep=0 --mixed_precision="fp16" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --optimizer_type="Adafactor" --output_dir="/kaggle/working/results/model" --output_name="My_DB_Kaggle" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --reg_data_dir="/kaggle/working/results/reg" --save_every_n_epochs="1" --save_every_n_steps="1300" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="1" --train_data_dir="/kaggle/working/results/img" --xformers'
@bmaltais
on kaggle fail vs work
maybe i need to add headless gonna try and let you know
sorry for this
Strange... so perhaps not related to the update then... Did something change on kaggle?
I have this same issue on ubuntu 22.04 training lora after update.
I have this same issue on ubuntu 22.04 training lora.
On a local Ubuntu or on a cloud service?
Local
Really weird… and running accelerate launch in the vent work?
running accelerate launch in venv does work showing me the options.
So this must be the shell=True that has been removed... you could try to add it manally to the class_command_executor to see if this fix the issue...
on line 32:
self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)
So this must be the shell=True that has been removed... you could try to add it manally to the class_command_executor to see if this fix the issue...
on line 32:
self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)
This did corrct things for me. Thanks
So this must be the shell=True that has been removed... you could try to add it manally to the class_command_executor to see if this fix the issue...
on line 32:
self.process = subprocess.Popen(run_cmd, shell=True, **kwargs)
I had the same problem on Runpod. but this solved the problem.
Ok, so at least now we know the root cause… but I can’t revert to how things were before because of the security risk it poses according to the security report I received… so I will need to find a different solution than just reinstating the shell=True parameter…
Ok, so at least now we know the root cause… but I can’t revert to how things were before because of the security risk it poses according to the security report I received… so I will need to find a different solution than just reinstating the shell=True parameter…
i am glad you found the error. when can we expect a solution? i am fine with unsecure access :D
Things should work in the dev
branch now. Major code rewrite but for the best. Was long overdue.
@bmaltais it works when you can merge in?
also when doing multi gpu training it makes 2 epoch and save 2 checkpoints still even though i train 1 epoch
For the multi flu issue this is really something that should be sorted with kohya.
I will need to do a few more things and bug fix to dev while users report issues with it. I would rather we take some time to test it right before making another release with multiple issues.
I need to see if I. An fix the Start/Stop training button issue where it will not go back to Start once the training is completed… it is annoying… If I can’t fix it I will probably go back to two separate buttons.
The code was working previously but since you made some system changes @bmaltais not working anymore
Issue should be easy to fix
When I click start training, it tries to run like this
But it can't find the command like below
When I copy paste it into a new cell and run as below it works perfectly fine. can you fix this urgently if possible?