bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
6.18k stars 620 forks source link

I ran the "Training" section and got this error #144

Closed xbox002000 closed 9 months ago

xbox002000 commented 1 year ago

running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 1500 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 1500 num epochs / epoch数: 1 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1500 steps: 0%| | 0/1500 [00:00<?, ?it/s]epoch 1/1 E:\SD\Kohya\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") Error no kernel image is available for execution on the device at line 167 in file D:\ai\tool\bitsandbytes\csrc\ops.cu Traceback (most recent call last): File "C:\Users\xbox0\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\xbox0\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "E:\SD\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "E:\SD\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "E:\SD\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command simple_launcher(args) File "E:\SD\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['E:\SD\Kohya\kohya_ss\venv\Scripts\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=E:/SD/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV13_v13.safetensors', '--train_data_dir=E:/SD/lora data/image', '--resolution=512,512', '--output_dir=E:/SD/lora data/model', '--logging_dir=E:/SD/lora data/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=150', '--train_batch_size=1', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--cache_latents', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--xformers', '--use_8bit_adam', '--bucket_no_upscale']' returned non-zero exit status 1.

Shamsiel4Life commented 1 year ago

I have the same issue. Have you found a solution yet?

xbox002000 commented 1 year ago

NO~

Humanoidme commented 1 year ago

Same error

Maranpani commented 1 year ago

Same Error ! Where is the smart genius to help us ? :)

CUDA SETUP: Loading binary C:\Users\Utilisateur\Documents\Kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... use 8-bit Adam optimizer running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 1700 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 850 num epochs / epoch数: 1 batch size per device / バッチサイズ: 2 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 2 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 850 Traceback (most recent call last): File "C:\Users\Utilisateur\Documents\Kohya\kohya_ss\train_network.py", line 573, in train(args) File "C:\Users\Utilisateur\Documents\Kohya\kohya_ss\train_network.py", line 356, in train "ss_noise_offset": args.noise_offset, AttributeError: 'Namespace' object has no attribute 'noise_offset' Traceback (most recent call last): File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\Utilisateur\Documents\Kohya\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\Utilisateur\Documents\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "C:\Users\Utilisateur\Documents\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command simple_launcher(args) File "C:\Users\Utilisateur\Documents\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Utilisateur\Documents\Kohya\kohya_ss\venv\Scripts\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=//UTILISATEUR-PC/Users/Utilisateur/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV13_v13.safetensors', '--train_data_dir=C:\Users\Utilisateur\Documents\Lora TRaining DAta\test\image', '--resolution=512,512', '--output_dir=C:\Users\Utilisateur\Documents\Lora TRaining DAta\test\model', '--logging_dir=C:\Users\Utilisateur\Documents\Lora TRaining DAta\test\log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=85', '--train_batch_size=2', '--max_train_steps=850', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--cache_latents', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--xformers', '--use_8bit_adam', '--bucket_no_upscale']' returned non-zero exit status 1.

Hifone2191 commented 1 year ago

I get the same error too

================================================================================ CUDA SETUP: Loading binary D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... use 8-bit Adam optimizer override steps. steps for 20 epochs is / 指定エポックまでのステップ数: 2280 Traceback (most recent call last): File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\sd-scripts\train_network.py", line 548, in train(args) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\sd-scripts\train_network.py", line 246, in train unet, text_encoder, network, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\accelerator.py", line 876, in prepare result = tuple( File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\accelerator.py", line 877, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\accelerator.py", line 741, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\accelerator.py", line 912, in prepare_model model = model.to(self.device) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\transformers\modeling_utils.py", line 1749, in to return super().to(*args, **kwargs) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in to return self._apply(convert) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply module._apply(fn) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply module._apply(fn) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\torch\nn\modules\module.py", line 579, in _apply module._apply(fn) [Previous line repeated 3 more times] File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\torch\nn\modules\module.py", line 602, in _apply param_applied = fn(param) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\torch\nn\modules\module.py", line 925, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.40 GiB already allocated; 0 bytes free; 3.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "C:\Users\hifon\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\hifon\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\Scripts\accelerate.exe__main__.py", line 7, in File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command simple_launcher(args) File "D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['D:\AI 繪圖\lora-scripts-main\lora-scripts-main\venv\Scripts\python.exe', './sd-scripts/train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=./sd-models/model.ckpt', '--train_data_dir=./train/KRTEmilia', '--output_dir=./output', '--logging_dir=./logs', '--resolution=512,512', '--network_module=networks.lora', '--max_train_epochs=20', '--learning_rate=1e-4', '--unet_lr=1e-4', '--text_encoder_lr=1e-5', '--lr_scheduler=cosine_with_restarts', '--lr_warmup_steps=0', '--network_dim=32', '--network_alpha=32', '--output_name=KRTEmilia', '--train_batch_size=1', '--save_every_n_epochs=2', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1337', '--cache_latents', '--clip_skip=2', '--prior_loss_weight=1', '--max_token_length=225', '--caption_extension=.txt', '--save_model_as=safetensors', '--min_bucket_reso=256', '--max_bucket_reso=1024', '--xformers', '--shuffle_caption', '--use_8bit_adam']' returned non-zero exit status 1. Train finished

mazswing commented 1 year ago

So, I gaess I have the same issue as most of the users here:

Traceback (most recent call last): File "C:\Users\matth\Documents\kohya\kohya_ss\train_network.py", line 573, in train(args) File "C:\Users\matth\Documents\kohya\kohya_ss\train_network.py", line 356, in train "ss_noise_offset": args.noise_offset, AttributeError: 'Namespace' object has no attribute 'noise_offset'

Can someone help us here please? Thank you

mazswing commented 1 year ago

CUDA out of memory. Tried to allocate 20.00 MiB (G

I suggest you go into "training paramezters" then "advanced configurations" and check the box "Memory efficient attention" is this helping?

Hifone2191 commented 1 year ago

OK I will try it. Thank you

Hifone2191 commented 1 year ago

I have solved the problem thank you

Maranpani commented 1 year ago

I have solved the problem thank you

how ?

Hifone2191 commented 1 year ago

I have solved the problem thank you

how ?

Find $train_unet_only = 0 in the document named "train.ps1" and let 0 switch to 1

chiakichi12 commented 1 year ago

It worked when I stoped using json data as a Configuration file. But I don't know why....

mazswing commented 1 year ago

I have solved the problem thank you

how ?

Find $train_unet_only = 0 in the document named "train.ps1" and let 0 switch to 1

Am I blind? Is the "train.ps1" file in the main folder \kohya\kohya_ss ??

Pls where do I find the Document, I still have errors :(

Hifone2191 commented 1 year ago

I have solved the problem thank you

how ?

Find $train_unet_only = 0 in the document named "train.ps1" and let 0 switch to 1

Am I blind? Is the "train.ps1" file in the main folder \kohya\kohya_ss ??

Pls where do I find the Document, I still have errors :(

I am using someone else's modpack, so it's possible that the original files don't include train.ps1. Sorry

sirPhoebus commented 1 year ago

I got rid of "Use 8bit adam" and it worked !! It's in the advanced tab :))

abc1231998068 commented 1 year ago

就是在我们最后配置的那个里面有一项这个

AshlynGuo commented 1 year ago

how to find "8bit Adam"? then delete it? pls help me ^-^

Zirnworks commented 1 year ago

I can't find how to disable 8bit Adam, either. I have scanned everything in the advanced config options, and still getting "returned non-zero exit status 1" when trying to train.

bichtoubard commented 1 year ago

Same trouble here. tried lot of options but the train does not run. Any new idea someone ??

buketerbil commented 1 year ago

Hey guys, I am also experiencing the same problem on MacOS. Has anyone got a solution yet?

ahnjin0210 commented 1 year ago

you guy can find Adam 8bit in LoRa>Training>Parameters> Basic > Optimizer > AdamW8bit. I changed option AdamW8bit to AdamW and than I could make my LoRa

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.