How to resume training?

heyuhhh commented 3 months ago

Hi, I have got some checkpoints after training, how can I continue to train by resuming from the checkpoints?

aihacker111 commented 3 months ago

You need to replace the pretrained model of author wtih your checkpoint trained model, and then comment all the code download pretrained model, run it again, it will load your checkpoint as pretrained model and continue training on ever datasets you set Noted: Make sure your lora rank unet and vae of your checkpoint and your new train has the same values

tlp-labmetro commented 1 month ago

Hi, I´m using the pix2pix_turbo and want to resume training. The first training was three days long, and I want to resume it with fewer, more challenging images.

I have tried several ways to resume it, including giving the path to the checkpoint file pkl, substituting the pre-trained file edge_to_image_loras.pkl, etc., but with no success.

I tried to make @aihacker111 write, but with no success.

Can someone give me a hint to resume training from a checkpoint with pix2pix_turbo?

Kind regards

aihacker111 commented 1 month ago

@tlp-labmetro You need to make sure the lora rank of edges and sketch edges is r_unet = 8 and r_vae = 4 sketch is r_unet = 128 and r_vae = 4 make sure all rank unet and vae is correct

aihacker111 commented 1 month ago

@tlp-labmetro this is my personal fine-tuning it on large dataset with mixed weights of edges and sketch https://huggingface.co/spaces/myn0908/S2I-Artwork-Sketch-to-Image-Diffusion you can see this for more details

tlp-labmetro commented 1 month ago

@aihacker111 thank you very much for you response. I look at the link you send me, but i confess i cannot find a solution (i´m more from the application side than software developer)

Anyway but trial an error i make it run with some modifications in the files, as below:

In the trainning call i use the path to my checckpint:

accelerate launch src/train_pix2pix_turbo.py \ --pretrained_model_name_or_path="path_to/checkpoints/model_39501.pkl" \ --output_dir="output/pix2pix_turbo/folder1" \ --dataset_folder="data/LTSdata/folder1" \ --resolution=512 \ --train_batch_size=1 \ --enable_xformers_memory_efficient_attention --viz_freq 25 \ --track_val_fid \ --report_to "wandb" --tracker_project_name "pix2pix_turbo_go"

Add these lines in the train_pix2pix_turbo.py file:

    if args.pretrained_model_name_or_path != "stabilityai/sd-turbo":
    print("checkpoint to trainpix2pix_turbo")
    net_pix2pix = Pix2Pix_Turbo(pretrained_path=args.pretrained_model_name_or_path, lora_rank_unet=args.lora_rank_unet, lora_rank_vae=args.lora_rank_vae)
    net_pix2pix.set_train()

Add some lines in the pix2pix_turbo.py file:

    elif pretrained_path is not None:
        print("Initializing model with OWN weights")
        sd = torch.load(pretrained_path, map_location="cpu")
        unet_lora_config = LoraConfig(r=sd["rank_unet"], init_lora_weights="gaussian", target_modules=sd["unet_lora_target_modules"])
        vae_lora_config = LoraConfig(r=sd["rank_vae"], init_lora_weights="gaussian", target_modules=sd["vae_lora_target_modules"])
        vae.add_adapter(vae_lora_config, adapter_name="vae_skip")
        _sd_vae = vae.state_dict()
        for k in sd["state_dict_vae"]:
            _sd_vae[k] = sd["state_dict_vae"][k]
        vae.load_state_dict(_sd_vae)
        unet.add_adapter(unet_lora_config)
        _sd_unet = unet.state_dict()
        for k in sd["state_dict_unet"]:
            _sd_unet[k] = sd["state_dict_unet"][k]
        unet.load_state_dict(_sd_unet)

        # Added tlp
        target_modules_vae = ["conv1", "conv2", "conv_in", "conv_shortcut", "conv", "conv_out",
            "skip_conv_1", "skip_conv_2", "skip_conv_3", "skip_conv_4",
            "to_k", "to_q", "to_v", "to_out.0",
        ]
        target_modules_unet = [
            "to_k", "to_q", "to_v", "to_out.0", "conv", "conv1", "conv2", "conv_shortcut", "conv_out",
            "proj_in", "proj_out", "ff.net.2", "ff.net.0.proj"
        ]
        unet_lora_config = LoraConfig(r=lora_rank_unet, init_lora_weights="gaussian",
            target_modules=target_modules_unet
        )
        self.lora_rank_unet = lora_rank_unet
        self.lora_rank_vae = lora_rank_vae
        self.target_modules_vae = target_modules_vae
        self.target_modules_unet = target_modules_unet

At least is running now, but not sure if the results will be correct.

Kind regards

aihacker111 commented 1 month ago

@tlp-labmetro show me your error when start training

tlp-labmetro commented 1 month ago

@aihacker111 Whitout the modifications, when I use the path to my local checkpoint instead of --pretrained_model_name_or_path="stabilityai/sd-turbo"

returns among other things: Initializing model with random weights.

aihacker111 commented 1 month ago

@tlp-labmetro you need to assign to the pix2pix-turbo code and then modify in this, the args you require just for checking in the training code , note relate to the model checkpoint

GaParmar / img2img-turbo

How to resume training? #43