TheLastBen / fast-stable-diffusion

fast-stable-diffusion + DreamBooth
MIT License
7.54k stars 1.31k forks source link

Model trained in Runpod does not work in the ComfyUI cell. #2429

Open Taikakim opened 1 year ago

Taikakim commented 1 year ago

Just this:

It's a SDXL LoRA, rank 256, 50 epochs

I copied the created safetensors file by hand to the ComfyUI directory, is this the right way?

`Total VRAM 24260 MB, total RAM 257664 MB xformers version: 0.0.20 Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce RTX 3090 : cudaMallocAsync Using xformers cross attention Starting server

✔ Connected https://* got prompt !!! Exception during processing !!! Traceback (most recent call last): File "/workspace/ComfyUI/execution.py", line 151, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) File "/workspace/ComfyUI/execution.py", line 81, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) File "/workspace/ComfyUI/execution.py", line 74, in map_node_over_list results.append(getattr(obj, func)(slice_dict(input_data_all, i))) File "/workspace/ComfyUI/nodes.py", line 446, in load_checkpoint out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings")) File "/workspace/ComfyUI/comfy/sd.py", line 1163, in load_checkpoint_guess_config model_config = model_detection.model_config_from_unet(sd, "model.diffusion_model.", fp16) File "/workspace/ComfyUI/comfy/model_detection.py", line 119, in model_config_from_unet unet_config = detect_unet_config(state_dict, unet_key_prefix, use_fp16) File "/workspace/ComfyUI/comfy/model_detection.py", line 36, in detect_unet_config model_channels = state_dict['{}input_blocks.0.0.weight'.format(key_prefix)].shape[0] KeyError: 'model.diffusion_model.input_blocks.0.0.weight'

Prompt executed in 0.04 seconds got prompt !!! Exception during processing !!! Traceback (most recent call last): File "/workspace/ComfyUI/execution.py", line 151, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) File "/workspace/ComfyUI/execution.py", line 81, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) File "/workspace/ComfyUI/execution.py", line 74, in map_node_over_list results.append(getattr(obj, func)(**slice_dict(input_data_all, i))) File "/workspace/ComfyUI/nodes.py", line 446, in load_checkpoint out = comfy.sd.load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, embedding_directory=folder_paths.get_folder_paths("embeddings")) File "/workspace/ComfyUI/comfy/sd.py", line 1163, in load_checkpoint_guess_config model_config = model_detection.model_config_from_unet(sd, "model.diffusion_model.", fp16) File "/workspace/ComfyUI/comfy/model_detection.py", line 119, in model_config_from_unet unet_config = detect_unet_config(state_dict, unet_key_prefix, use_fp16) File "/workspace/ComfyUI/comfy/model_detection.py", line 36, in detect_unet_config model_channels = state_dict['{}input_blocks.0.0.weight'.format(key_prefix)].shape[0] KeyError: 'model.diffusion_model.input_blocks.0.0.weight'

Prompt executed in 0.03 seconds`

Taikakim commented 1 year ago

Also in A1111 I'm getting a longer error with I guess this relevant piece at the end: modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the "Upcast cross attention layer to float32" option in Settings > Stable Diffusion or using the --no-half commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.

Taikakim commented 1 year ago

Hmm here is something weird, I also did a SDXL LoRA with an A100 and the Kohya trainer, and still get the same error... I was using FP16, should I use BF16? Then again, in the Runpod notebook there is no setting for this.

TheLastBen commented 1 year ago

once done the training, you don't need to copy it manually, simply run the testing cell and you'll find the lora ready in the lora list for A1111 or the lora node in ComfyUI, try removing any model that you copied and restart

Taikakim commented 1 year ago

No, there was only the base model when running the ComfyUI cell, that's why I had to copy it by hand.

Those errors are weird though, and happen both in your notebook, as well as when loading the model to ComfyUI native notebook... Both RTX 3090 and A100 models, all three which I created, fail in the same way.

The only reason I can think of is that there's maybe alpha channel preset in the training or regularisation images, maybe that can throw things off?

TheLastBen commented 1 year ago

the trained model isn't a full model, it's a LoRA, and it's automatically put inside the loras folder, you don't need to copy anything find some tutorials how to use comfyui on youtube, especially about the lora node

Taikakim commented 1 year ago

That's... True 🤦🤦🤦🤦

I'll just go be ashamed in the corner 😅 This is what you get when doing stuff past midnight...