Akegarasu / lora-scripts

LoRA & Dreambooth training scripts & GUI use kohya-ss's trainer, for diffusion model.
GNU Affero General Public License v3.0
4.44k stars 552 forks source link

Flux 训练报错 #528

Closed ZeroYuJie closed 1 week ago

ZeroYuJie commented 1 week ago
                INFO     Building CLIP                                                                                                                              flux_utils.py:74
                INFO     Loading state dict from /sd_model/clip/sd3/clip_l.safetensors                                                                             flux_utils.py:167

2024-09-25 19:53:06 INFO Loaded CLIP: flux_utils.py:170 INFO Loading state dict from /sd_model/clip/sd3/t5xxl_fp8_e4m3fn.safetensors flux_utils.py:215 2024-09-25 19:53:09 INFO Loaded T5xxl: flux_utils.py:218 INFO Loaded fp8 T5XXL model flux_train_network.py:101 INFO Building AutoEncoder flux_utils.py:62 INFO Loading state dict from /sd_model/vae/ae.safetensors flux_utils.py:66 2024-09-25 19:53:10 INFO Loaded AE: flux_utils.py:69 import network module: networks.lora_flux 2024-09-25 19:53:11 INFO [Dataset 0] train_util.py:2329 INFO caching latents with caching strategy. train_util.py:989 INFO checking cache validity... train_util.py:999 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 396/396 [00:03<00:00, 124.20it/s] 2024-09-25 19:53:14 INFO no latents to cache train_util.py:1039 2024-09-25 19:53:15 INFO move vae and unet to cpu to save memory flux_train_network.py:208 Traceback (most recent call last): File "/app/lora-scripts/./scripts/dev/flux_train_network.py", line 519, in trainer.train(args) File "/app/lora-scripts/scripts/dev/train_network.py", line 402, in train self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype) File "/app/lora-scripts/./scripts/dev/flux_train_network.py", line 212, in cache_text_encoder_outputs_if_needed unet.to("cpu") File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1174, in to return self._apply(convert) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 780, in _apply module._apply(fn) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 805, in _apply param_applied = fn(param) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1167, in convert raise NotImplementedError( NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. 19:53:16-580989 ERROR Training failed / 训练失败

使用的docker nvcr.io/nvidia/pytorch:24.07-py3 做基础镜像,按照安装脚本去install 无法启动训练

ZeroYuJie commented 1 week ago

https://github.com/Akegarasu/lora-scripts/issues/477