Errors at the start of training

/workspace/HCP-Diffusion# accelerate launch -m hcpdiff.train_ac_single \ --cfg cfgs/train/examples/lora_anime_character.yaml \ character_name=noah \ dataset_dir=/workspace/HCP-Diffusion/data/noah wandb is not available /usr/lib/python3.10/runpy.py:126: RuntimeWarning: 'hcpdiff.train_ac_single' found in sys.modules after import of package 'hcpdiff', but prior to execution of 'hcpdiff.train_ac_single'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) 2023-10-13 08:03:42.162 | INFO | hcpdiff.loggers.cli_logger:_info:30 - world_size: 1 2023-10-13 08:03:42.162 | INFO | hcpdiff.loggers.cli_logger:_info:30 - accumulation: 1 You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors. 2023-10-13 08:03:45.678 | INFO | hcpdiff.models.text_emb_ex:hook:86 - hook: noah, len: 4, id: 28806 2023-10-13 08:03:45.830 | INFO | hcpdiff.data.caption_loader:load:18 - 2 record(s) loaded with JsonCaptionLoader, from path '/workspace/HCP-Diffusion/data/noah/image_captions.json' 2023-10-13 08:03:45.831 | INFO | hcpdiff.data.bucket:build_buckets_from_images:241 - build buckets from images size /usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of n_init will change from 10 to 'auto' in 1.4. Set the value of n_init explicitly to suppress the warning super()._check_params_vs_input(X, default_n_init=10) 2023-10-13 08:03:45.851 | INFO | hcpdiff.data.bucket:build_buckets_from_images:262 - buckets info: size:[640 896], num:2 2023-10-13 08:03:45.851 | INFO | hcpdiff.loggers.cli_logger:_info:30 - len(train_dataset): 4 0%| | 0/4 [00:00<?, ?it/s]/workspace/HCP-Diffusion/hcpdiff/data/pair_dataset.py:107: FutureWarning: Accessing config attribute scaling_factor directly via 'AutoencoderKL' object attribute is deprecated. Please access 'scaling_factor' over 'AutoencoderKL's config object instead, e.g. 'unet.config.scaling_factor'. data['img'] = (latents*vae.scaling_factor).cpu() 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11.16it/s] /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py:128: FutureWarning: The configuration file of this scheduler: PNDMScheduler { "_class_name": "PNDMScheduler", "_diffusers_version": "0.21.4", "beta_end": 0.012, "beta_schedule": "scaled_linear", "beta_start": 0.00085, "num_train_timesteps": 1000, "prediction_type": "epsilon", "set_alpha_to_one": false, "skip_prk_steps": false, "steps_offset": 0, "timestep_spacing": "leading", "trained_betas": null } is outdated. steps_offset should be set to 1 instead of 0. Please make sure to update the config accordingly as leaving steps_offset might led to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the scheduler/scheduler_config.json file deprecate("steps_offset!=1", "1.0.0", deprecation_message, standard_warn=False) 2023-10-13 08:03:46.798 | INFO | hcpdiff.loggers.cli_logger:_info:30 - Running training 2023-10-13 08:03:46.798 | INFO | hcpdiff.loggers.cli_logger:_info:30 - Num batches each epoch = 1 2023-10-13 08:03:46.798 | INFO | hcpdiff.loggers.cli_logger:_info:30 - Num Steps = 1000 2023-10-13 08:03:46.798 | INFO | hcpdiff.loggers.cli_logger:_info:30 - Instantaneous batch size per device = 4 2023-10-13 08:03:46.799 | INFO | hcpdiff.loggers.cli_logger:_info:30 - Total train batch size (w. parallel, distributed & accumulation) = 4 2023-10-13 08:03:46.799 | INFO | hcpdiff.loggers.cli_logger:_info:30 - Gradient Accumulation steps = 1 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workspace/HCP-Diffusion/hcpdiff/train_ac_single.py", line 61, in trainer.train() File "/workspace/HCP-Diffusion/hcpdiff/train_ac.py", line 383, in train loss = self.train_one_step(data_list) File "/workspace/HCP-Diffusion/hcpdiff/train_ac.py", line 480, in train_one_step self.optimizer_pt.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 132, in step self.scaler.step(self.optimizer, closure) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 372, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." AssertionError: No inf checks were recorded for this optimizer. Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-m', 'hcpdiff.train_ac_single', '--cfg', 'cfgs/train/examples/lora_anime_character.yaml', 'character_name=noah', 'dataset_dir=/workspace/HCP-Diffusion/data/noah']' returned non-zero exit status 1.

I've been trying all day and haven't been able to solve it. OS: Ubuntu 22.04.2 LTS

IrisRainbowNeko / HCP-Diffusion

Errors at the start of training #40