Open cosmiclantern opened 1 year ago
yes it seems like Google Colab did an update recently and things have broken as a result? I get the same error no matter what GPU i choose now.
Options changed, premium/standard is gone.
Glad I'm not the only one!
Will be fixed by this MR https://github.com/Linaqruf/kohya-trainer/pull/179/files. In the mean time change the line before running the cell
os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64/:$LD_LIBRARY_PATH"
to
os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64/:" + os.environ["LD_LIBRARY_PATH"]
in the first cell (installing dependencies) will fix the issue
Edited: typo missing colon ":"
Okay the trainer's working now, so thanks for that! But the Waifu Diffusion Tagger is broken suddenly:
using existing wd14 tagger model
found 11 images.
loading model and labels
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/kohya-trainer/finetune/tag_images_by_wd14_tagger.py:200 in tf.debugging.disable_traceback_filtering()
│
│ ❱ 70 │ │ │ raise e.with_traceback(filtered_tb) from None │
│ 71 │ │ finally: │
│ 72 │ │ │ del filtered_tb │
│ 73 │
│ │
│ /usr/local/lib/python3.9/dist-packages/tensorflow/python/savedmodel/loader │
│ impl.py:115 in parse_saved_model │
│ │
│ 112 │ except text_format.ParseError as e: │
│ 113 │ raise IOError(f"Cannot parse file {path_to_pbtxt}: {str(e)}.") │
│ 114 else: │
│ ❱ 115 │ raise IOError( │
│ 116 │ │ f"SavedModel file does not exist at: {export_dir}{os.path.sep} │
│ 117 │ │ f"{{{constants.SAVED_MODEL_FILENAME_PBTXT}|" │
│ 118 │ │ f"{constants.SAVED_MODEL_FILENAME_PB}}}") │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: SavedModel file does not exist at:
wd14_tagger_model/{saved_model.pbtxt|saved_model.pb}
os.environ["LD_LIBRARY_PATH"] = "/usr/local/cuda/lib64/:" + os.environ["LD_LIBRARY_PATH"]
That worked, thank you so much 🙏🏼
Hi, sorry for the problem, and thanks to @huytd2k for the PR. I planned to delete the env changing for LD_LIBRARY_PATH
because bitsandbytes can find the path itself, but then I found the right path for the fix. I also fixed the problem with xformers for fast-kohya-trainer.
https://github.com/Linaqruf/kohya-trainer/commit/ff701379c65380c967cd956e4e9e8f6349563878
If something like this happens again, I suggest leaving a comment on os.environ("LD_LIBRARY_PATH") in the 1.1. Install Dependencies
section.
Thanks for the fix!!
Kohya Dreambooth trainer. Suddenly started throwing this error today, was working perfectly a few hours prior. Refuses to see the GPU.
Running a T4 in Google Colab Pro.
Starts when running BLIP and Waifu Diffsion Tagger, and then won't run the trainer from the "Start Training" cell:
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) Loading settings from /content/dreambooth/config/config_file.toml... /content/dreambooth/config/config_file prepare tokenizer Downloading (…)olve/main/vocab.json: 961kB [00:00, 15.2MB/s] Downloading (…)olve/main/merges.txt: 525kB [00:00, 7.93MB/s] Downloading (…)cial_tokens_map.json: 100% 389/389 [00:00<00:00, 71.9kB/s] Downloading (…)okenizer_config.json: 100% 905/905 [00:00<00:00, 337kB/s] update token length: 225 Load dataset config from /content/dreambooth/config/dataset_config.toml prepare images. found directory /content/dreambooth/train_data contains 11 image files found directory /content/dreambooth/reg_data contains 0 image files ignore subset with image_dir='/content/dreambooth/reg_data': no images found / 画像が見つからないためサブセットを無視します 110 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 1 resolution: (512, 512) enable_bucket: True min_bucket_reso: 256 max_bucket_reso: 1024 bucket_reso_steps: 64 bucket_no_upscale: False
[Subset 0 of Dataset 0] image_dir: "/content/dreambooth/train_data" image_count: 11 num_repeats: 10 shuffle_caption: True keep_tokens: 1 caption_dropout_rate: 0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False is_reg: False class_tokens: vhtsrtl Girl caption_extension: .txt
[Dataset 0] loading image sizes. 100% 11/11 [00:00<00:00, 1518.77it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (512, 512), count: 110 mean ar error (without repeats): 0.0 prepare accelerator ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /content/kohya-trainer/train_db.py:409 in │
│ │
│ 406 │ args = parser.parse_args() │
│ 407 │ args = train_util.read_config_from_file(args, parser) │
│ 408 │ │
│ ❱ 409 │ train(args) │
│ 410 │
│ │
│ /content/kohya-trainer/train_db.py:85 in train │
│ │
│ 82 │ │ │ f"gradient_accumulation_stepsが{args.gradient_accumulation │
│ 83 │ │ ) │
│ 84 │ │
│ ❱ 85 │ accelerator, unwrap_model = train_util.prepare_accelerator(args) │
│ 86 │ │
│ 87 │ # mixed precisionに対応した型を用意しておき適宜castする │
│ 88 │ weight_dtype, save_dtype = train_util.prepare_dtype(args) │
│ │
│ /content/kohya-trainer/library/train_util.py:2461 in prepare_accelerator │
│ │
│ 2458 │ │ log_prefix = "" if args.log_prefix is None else args.log_pref │
│ 2459 │ │ logging_dir = args.logging_dir + "/" + log_prefix + time.strf │
│ 2460 │ │
│ ❱ 2461 │ accelerator = Accelerator( │
│ 2462 │ │ gradient_accumulation_steps=args.gradient_accumulation_steps, │
│ 2463 │ │ mixed_precision=args.mixed_precision, │
│ 2464 │ │ log_with=log_with, │
│ │
│ /usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:355 in │
│ init │
│ │
│ 352 │ │ if self.state.mixedprecision == "fp16" and self.distributed │
│ 353 │ │ │ self.native_amp = True │
│ 354 │ │ │ if not torch.cuda.is_available() and not parse_flagfrom │
│ ❱ 355 │ │ │ │ raise ValueError(err.format(mode="fp16", requirement= │
│ 356 │ │ │ kwargs = self.scaler_handler.to_kwargs() if self.scaler_h │
│ 357 │ │ │ if self.distributed_type == DistributedType.FSDP: │
│ 358 │ │ │ │ from torch.distributed.fsdp.sharded_grad_scaler impor │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: fp16 mixed precision requires a GPU
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/local/bin/accelerate:8 in │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if name == 'main': │
│ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py │
│ :45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if name == "main": │
│ │
│ /usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py:1104 in │
│ launch_command │
│ │
│ 1101 │ elif defaults is not None and defaults.compute_environment == Com │
│ 1102 │ │ sagemaker_launcher(defaults, args) │
│ 1103 │ else: │
│ ❱ 1104 │ │ simple_launcher(args) │
│ 1105 │
│ 1106 │
│ 1107 def main(): │
│ │
│ /usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py:567 in │
│ simple_launcher │
│ │
│ 564 │ process = subprocess.Popen(cmd, env=current_env) │
│ 565 │ process.wait() │
│ 566 │ if process.returncode != 0: │
│ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.return │
│ 568 │
│ 569 │
│ 570 def multi_gpu_launcher(args): │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/usr/bin/python3', 'train_db.py',
'--sample_prompts=/content/dreambooth/config/sample_prompt.txt',
'--dataset_config=/content/dreambooth/config/dataset_config.toml',
'--config_file=/content/dreambooth/config/config_file.toml']' returned non-zero
exit status 1.