bmaltais / kohya_ss

Apache License 2.0
9.41k stars 1.22k forks source link

after 1epoch,the trainning is interrunpt,can not load the pre-trained data #2856

Open Lecho303 opened 3 days ago

Lecho303 commented 3 days ago

i was setting 3epochs,but the 1epoch finished,the sample was not creat,and send this error: generating sample images at step / サンプル画像生成 ステップ: 2100 train_util.py:5569 2024-09-27 01:42:06 INFO prompt: masterpiece,1big bowl, chopped green onion,Chicken soup, waste train_util.py:5722 paper placed in the bottom of the bowl to dissipate heat,warmlight,Chinese-food-style 2024-09-27 01:42:07 INFO negative_prompt: low quality, worst quality, bad anatomy, bad train_util.py:5723 composition, poor, low effort INFO height: 1200 train_util.py:5724 INFO width: 1024 train_util.py:5725 INFO sample_steps: 20 train_util.py:5726 INFO scale: 7.5 train_util.py:5727 INFO sample_sampler: euler_a train_util.py:5728 INFO seed: 1 train_util.py:5730 C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( Traceback (most recent call last): File "C:\Users\ningl\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in trainer.train(args) File "C:\Users\ningl\kohya_ss\sd-scripts\train_network.py", line 1085, in train self.sample_images(accelerator, args, epoch + 1, global_step, accelerator.device, vae, tokenizer, text_encoder, unet) File "C:\Users\ningl\kohya_ss\sd-scripts\sdxl_train_network.py", line 168, in sample_images sdxl_train_util.sample_images(accelerator, args, epoch, global_step, device, vae, tokenizer, text_encoder, unet) File "C:\Users\ningl\kohya_ss\sd-scripts\library\sdxl_train_util.py", line 381, in sample_images return train_util.sample_images_common(SdxlStableDiffusionLongPromptWeightingPipeline, *args, kwargs) File "C:\Users\ningl\kohya_ss\sd-scripts\library\train_util.py", line 5644, in sample_images_common sample_image_inference( File "C:\Users\ningl\kohya_ss\sd-scripts\library\train_util.py", line 5732, in sample_image_inference latents = pipeline( File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "C:\Users\ningl\kohya_ss\sd-scripts\library\sdxl_lpw_stable_diffusion.py", line 1012, in call noise_pred = self.unet(latent_model_input, t, text_embedding, vector_embedding) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 680, in forward return model_forward(*args, *kwargs) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 668, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast return func(*args, **kwargs) File "C:\Users\ningl\kohya_ss\sd-scripts\library\sdxl_original_unet.py", line 1110, in forward h = torch.cat([h, hs.pop()], dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 76 but got size 75 for tensor number 1 in the list. steps: 25%|▎| 2100/8400 [33:10:44<99:32:13, 56.88s/it, Average key norm=tensor(2.4855, device='cuda:0'), Keys Scaled=t Traceback (most recent call last): File "C:\Users\ningl\miniconda3\envs\kohyass\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\ningl\miniconda3\envs\kohyass\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\ningl\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\ningl\kohya_ss\venv\Scripts\python.exe', 'C:/Users/ningl/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'C:/Users/ningl/Desktop/2new/model/config_lora-20240925-163127.toml']' returned non-zero exit status 1. 01:42:38-420432 INFO Training has ended.

and i want to try loading the last Resume,found the 1epoch-state folder,then getting start,then i got this error: Traceback (most recent call last): File "C:\Users\ningl\kohya_ss\sd-scripts\sdxl_train_network.py", line 185, in trainer.train(args) File "C:\Users\ningl\kohya_ss\sd-scripts\train_network.py", line 539, in train train_util.resume_from_local_or_hf_if_specified(accelerator, args) File "C:\Users\ningl\kohya_ss\sd-scripts\library\train_util.py", line 4209, in resume_from_local_or_hf_if_specified accelerator.load_state(args.resume) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 2861, in load_state load_accelerator_state( File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\checkpointing.py", line 204, in load_accelerator_state state_dict = torch.load(input_model_file, map_location=map_location) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 986, in load with _open_file_like(f, 'rb') as opened_file: File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 435, in _open_file_like return _open_file(name_or_buffer, mode) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 416, in init super().init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\ningl\Desktop\2new\model\Chinese-food-style-000001-state\pytorch_model.bin' Traceback (most recent call last): File "C:\Users\ningl\miniconda3\envs\kohyass\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\ningl\miniconda3\envs\kohyass\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\ningl\kohya_ss\venv\Scripts\accelerate.EXE__main__.py", line 7, in File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Users\ningl\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\ningl\kohya_ss\venv\Scripts\python.exe', 'C:/Users/ningl/kohya_ss/sd-scripts/sdxl_train_network.py', '--config_file', 'C:/Users/ningl/Desktop/2new/model/config_lora-20240927-105417.toml']' returned non-zero exit status 1. 10:54:45-511841 INFO Training has ended.

i am confuse:what should i do ? i can not find the"pytorch_model.bin"in this state folder,just find the optimizer\scheduler.bin document

b-fission commented 3 days ago
INFO height: 1200 train_util.py:5724
INFO width: 1024 train_util.py:5725

Make sure the width and height values on your sample prompt are divisible by 32. 1200 won't work, but you can change it to 1216.

Lecho303 commented 3 days ago
INFO height: 1200 train_util.py:5724
INFO width: 1024 train_util.py:5725

Make sure the width and height values on your sample prompt are divisible by 32. 1200 won't work, but you can change it to 1216.

OK~thanks,i do not not know that is a problem…and when i continue to train, will tell me that can not load the 1epoch state?