devilismyfriend / StableTuner

Finetuning SD in style.
GNU Affero General Public License v3.0
666 stars 51 forks source link

ZeroDivisionError: division by zero #95

Open NoteToSelfFindGoodNickname opened 1 year ago

NoteToSelfFindGoodNickname commented 1 year ago

Environment name is set as "ST" as per environment.yaml anaconda3/miniconda3 detected in C:\Users\tomwe\miniconda3 Starting conda environment "ST" from C:\Users\tomwe\miniconda3 warning: redirecting to https://github.com/devilismyfriend/StableTuner.git/ Latest git hash: ef51982

(ST) C:\Users\tomwe\st4>accelerate "launch" "--mixed_precision=fp16" "scripts/trainer.py" "--attention=xformers" "--model_variant=base" "--normalize_masked_area_loss" "--unmasked_probability=0.0" "--max_denoising_strength=1.0" "--disable_cudnn_benchmark" "--use_text_files_as_captions" "--sample_step_interval=50" "--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1-base" "--pretrained_vae_name_or_path=" "--output_dir=models/iconex" "--seed=3434554" "--resolution=512" "--train_batch_size=24" "--num_train_epochs=100" "--mixed_precision=fp16" "--use_bucketing" "--aspect_mode=dynamic" "--aspect_mode_action_preference=add" "--use_8bit_adam" "--gradient_checkpointing" "--gradient_accumulation_steps=1" "--learning_rate=3e-6" "--lr_warmup_steps=0" "--lr_scheduler=constant" "--train_text_encoder" "--concepts_list=stabletune_concept_list.json" "--num_class_images=200" "--save_every_n_epoch=5" "--n_save_sample=2" "--sample_height=512" "--sample_width=512" "--dataset_repeats=1" "--add_sample_prompt=an apple by iconex" "--sample_on_training_start" The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. Booting Up StableTuner Please wait a moment as we load up some stuff... You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA SETUP: Loading binary C:\Users\tomwe\miniconda3\envs\ST\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... C:\Users\tomwe\miniconda3\envs\ST\lib\site-packages\diffusers\configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to from_config.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0. deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False) Creating Auto Bucketing Dataloader Rounded resolution to: 512 Preloading images... Processing C:/Users/tomwe/Desktop/auswahl: 100%|█████████████████████████████| 165/165 [00:00<00:00, 10560.81it/s] Number of buckets: 1 ** Bucket (512, 512) found 35 images, will drop 11 images due to batch size 24 Number of image-caption pairs: 24

** Validation Set: val, steps: 1, repeats: 1

Loading Latent Cache from models\iconex\logs\latent_cache Latents are ready. Traceback (most recent call last): File "C:\Users\tomwe\st4\scripts\trainer.py", line 2902, in main() File "C:\Users\tomwe\st4\scripts\trainer.py", line 2216, in main args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch) ZeroDivisionError: division by zero Traceback (most recent call last): File "C:\Users\tomwe\miniconda3\envs\ST\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\tomwe\miniconda3\envs\ST\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\tomwe\miniconda3\envs\ST\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\tomwe\miniconda3\envs\ST\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "C:\Users\tomwe\miniconda3\envs\ST\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command simple_launcher(args) File "C:\Users\tomwe\miniconda3\envs\ST\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\tomwe\miniconda3\envs\ST\python.exe', 'scripts/trainer.py', '--attention=xformers', '--model_variant=base', '--normalize_masked_area_loss', '--unmasked_probability=0.0', '--max_denoising_strength=1.0', '--disable_cudnn_benchmark', '--use_text_files_as_captions', '--sample_step_interval=50', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1-base', '--pretrained_vae_name_or_path=', '--output_dir=models/iconex', '--seed=3434554', '--resolution=512', '--train_batch_size=24', '--num_train_epochs=100', '--mixed_precision=fp16', '--use_bucketing', '--aspect_mode=dynamic', '--aspect_mode_action_preference=add', '--use_8bit_adam', '--gradient_checkpointing', '--gradient_accumulation_steps=1', '--learning_rate=3e-6', '--lr_warmup_steps=0', '--lr_scheduler=constant', '--train_text_encoder', '--concepts_list=stabletune_concept_list.json', '--num_class_images=200', '--save_every_n_epoch=5', '--n_save_sample=2', '--sample_height=512', '--sample_width=512', '--dataset_repeats=1', '--add_sample_prompt=an apple by iconex', '--sample_on_training_start']' returned non-zero exit status 1

NoteToSelfFindGoodNickname commented 1 year ago

There seems to be an error where you recalculate when images were dropped due to batch size. For example, I had 29 images, but batch size was 24. 5 images were dropped. Once I changed the batch size to 29, the error was gone.