Linaqruf / kohya-trainer

Adapted from https://note.com/kohya_ss/n/nbf7ce8d80f29 for easier cloning
Apache License 2.0
1.87k stars 308 forks source link

No GPU/TPU Found, falling back to CPU #178

Open cosmiclantern opened 1 year ago

cosmiclantern commented 1 year ago

Kohya Dreambooth trainer. Suddenly started throwing this error today, was working perfectly a few hours prior. Refuses to see the GPU.

Running a T4 in Google Colab Pro.

Starts when running BLIP and Waifu Diffsion Tagger, and then won't run the trainer from the "Start Training" cell:

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) Loading settings from /content/dreambooth/config/config_file.toml... /content/dreambooth/config/config_file prepare tokenizer Downloading (…)olve/main/vocab.json: 961kB [00:00, 15.2MB/s] Downloading (…)olve/main/merges.txt: 525kB [00:00, 7.93MB/s] Downloading (…)cial_tokens_map.json: 100% 389/389 [00:00<00:00, 71.9kB/s] Downloading (…)okenizer_config.json: 100% 905/905 [00:00<00:00, 337kB/s] update token length: 225 Load dataset config from /content/dreambooth/config/dataset_config.toml prepare images. found directory /content/dreambooth/train_data contains 11 image files found directory /content/dreambooth/reg_data contains 0 image files ignore subset with image_dir='/content/dreambooth/reg_data': no images found / 画像が見つからないためサブセットを無視します 110 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: False

[Subset 0 of Dataset 0] image_dir: "/content/dreambooth/train_data" image_count: 11 num_repeats: 10 shuffle_caption: True keep_tokens: 1 caption_dropout_rate: 0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False is_reg: False class_tokens: vhtsrtl Girl caption_extension: .txt

[Dataset 0] loading image sizes. 100% 11/11 [00:00<00:00, 1518.77it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (512, 512), count: 110 mean ar error (without repeats): 0.0 prepare accelerator ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /content/kohya-trainer/train_db.py:409 in │ │ │ │ 406 │ args = parser.parse_args() │ │ 407 │ args = train_util.read_config_from_file(args, parser) │ │ 408 │ │ │ ❱ 409 │ train(args) │ │ 410 │ │ │ │ /content/kohya-trainer/train_db.py:85 in train │ │ │ │ 82 │ │ │ f"gradient_accumulation_stepsが{args.gradient_accumulation │ │ 83 │ │ ) │ │ 84 │ │ │ ❱ 85 │ accelerator, unwrap_model = train_util.prepare_accelerator(args) │ │ 86 │ │ │ 87 │ # mixed precisionに対応した型を用意しておき適宜castする │ │ 88 │ weight_dtype, save_dtype = train_util.prepare_dtype(args) │ │ │ │ /content/kohya-trainer/library/train_util.py:2461 in prepare_accelerator │ │ │ │ 2458 │ │ log_prefix = "" if args.log_prefix is None else args.log_pref │ │ 2459 │ │ logging_dir = args.logging_dir + "/" + log_prefix + time.strf │ │ 2460 │ │ │ ❱ 2461 │ accelerator = Accelerator( │ │ 2462 │ │ gradient_accumulation_steps=args.gradient_accumulation_steps, │ │ 2463 │ │ mixed_precision=args.mixed_precision, │ │ 2464 │ │ log_with=log_with, │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:355 in │ │ init │ │ │ │ 352 │ │ if self.state.mixedprecision == "fp16" and self.distributed │ │ 353 │ │ │ self.native_amp = True │ │ 354 │ │ │ if not torch.cuda.is_available() and not parse_flagfrom │ │ ❱ 355 │ │ │ │ raise ValueError(err.format(mode="fp16", requirement= │ │ 356 │ │ │ kwargs = self.scaler_handler.to_kwargs() if self.scaler_h │ │ 357 │ │ │ if self.distributed_type == DistributedType.FSDP: │ │ 358 │ │ │ │ from torch.distributed.fsdp.sharded_grad_scaler impor │ ╰──────────────────────────────────────────────────────────────────────────────╯ ValueError: fp16 mixed precision requires a GPU ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/local/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py │ │ :45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py:1104 in │ │ launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == Com │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py:567 in │ │ simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.return │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python3', 'train_db.py', '--sample_prompts=/content/dreambooth/config/sample_prompt.txt', '--dataset_config=/content/dreambooth/config/dataset_config.toml', '--config_file=/content/dreambooth/config/config_file.toml']' returned non-zero exit status 1.

MushroomFleet commented 1 year ago

yes it seems like Google Colab did an update recently and things have broken as a result? I get the same error no matter what GPU i choose now.

Options changed, premium/standard is gone. 680df42ed1008de06e0d1282b624cc31

cosmiclantern commented 1 year ago

Glad I'm not the only one!

huytd2k commented 1 year ago

Will be fixed by this MR https://github.com/Linaqruf/kohya-trainer/pull/179/files. In the mean time change the line before running the cell

    os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64/:$LD_LIBRARY_PATH"

to

    os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64/:" + os.environ["LD_LIBRARY_PATH"]

in the first cell (installing dependencies) will fix the issue

Edited: typo missing colon ":"

cosmiclantern commented 1 year ago

Okay the trainer's working now, so thanks for that! But the Waifu Diffusion Tagger is broken suddenly:

using existing wd14 tagger model found 11 images. loading model and labels ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /content/kohya-trainer/finetune/tag_images_by_wd14_tagger.py:200 in │ │ │ │ 197 if args.caption_extention is not None: │ │ 198 │ args.caption_extension = args.caption_extention │ │ 199 │ │ ❱ 200 main(args) │ │ 201 │ │ │ │ /content/kohya-trainer/finetune/tag_images_by_wd14_tagger.py:96 in main │ │ │ │ 93 print(f"found {len(image_paths)} images.") │ │ 94 │ │ 95 print("loading model and labels") │ │ ❱ 96 model = load_model(args.model_dir) │ │ 97 │ │ 98 # label_names = pd.read_csv("2022_0000_0899_6549/selected_tags.csv") │ │ 99 # 依存ライブラリを増やしたくないので自力で読むよ │ │ │ │ /usr/local/lib/python3.9/dist-packages/keras/utils/traceback_utils.py:70 in │ │ error_handler │ │ │ │ 67 │ │ │ filtered_tb = _process_traceback_frames(e.traceback) │ │ 68 │ │ │ # To get the full stack trace, call: │ │ 69 │ │ │ # tf.debugging.disable_traceback_filtering() │ │ ❱ 70 │ │ │ raise e.with_traceback(filtered_tb) from None │ │ 71 │ │ finally: │ │ 72 │ │ │ del filtered_tb │ │ 73 │ │ │ │ /usr/local/lib/python3.9/dist-packages/tensorflow/python/savedmodel/loader │ │ impl.py:115 in parse_saved_model │ │ │ │ 112 │ except text_format.ParseError as e: │ │ 113 │ raise IOError(f"Cannot parse file {path_to_pbtxt}: {str(e)}.") │ │ 114 else: │ │ ❱ 115 │ raise IOError( │ │ 116 │ │ f"SavedModel file does not exist at: {export_dir}{os.path.sep} │ │ 117 │ │ f"{{{constants.SAVED_MODEL_FILENAME_PBTXT}|" │ │ 118 │ │ f"{constants.SAVED_MODEL_FILENAME_PB}}}") │ ╰──────────────────────────────────────────────────────────────────────────────╯ OSError: SavedModel file does not exist at: wd14_tagger_model/{saved_model.pbtxt|saved_model.pb}

Omenizer commented 1 year ago
    os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64/:" + os.environ["LD_LIBRARY_PATH"]

That worked, thank you so much 🙏🏼

Linaqruf commented 1 year ago

Hi, sorry for the problem, and thanks to @huytd2k for the PR. I planned to delete the env changing for LD_LIBRARY_PATH because bitsandbytes can find the path itself, but then I found the right path for the fix. I also fixed the problem with xformers for fast-kohya-trainer.

https://github.com/Linaqruf/kohya-trainer/commit/ff701379c65380c967cd956e4e9e8f6349563878

If something like this happens again, I suggest leaving a comment on os.environ("LD_LIBRARY_PATH") in the 1.1. Install Dependencies section. image

cosmiclantern commented 1 year ago

Thanks for the fix!!