Unable to train successfully

greeneggsandyaml commented 1 year ago

Hello and thank you for your very nice paper!

I am trying to train a view-conditional network using the code in zero123, but something is going wrong. I am wondering if my command is wrong, or if there is something else that I am missing.

I am using the command:

python main.py --base configs/sd-objaverse-finetune-c_concat-256.yaml --train --gpus=0,1,2,3 precision=16

I have trained for 10,000 steps and it is evident from the generations that something is going wrong. Do you know why this might be / should I be using a different command?

For context, the logged images look as follows:

inputs_gs-000000_e-000000_b-000000:

conditioning_gs-000000_e-000000_b-000000:

reconstruction_gs-000000_e-000000_b-000000

samples_gs-000000_e-000000_b-000000

samples_cfg_scale_3 00_gs-000000_e-000000_b-000000

Thank you so much for your help!

ruoshiliu commented 1 year ago

Hi @greeneggsandyaml , we initialized our model weights with the image-conditioned stable diffusion released by lambda-labs. Can you share the loss curve of your training as well? It looks to me like the loss has diverged due to instability. My guess you are using smaller batch size (4 vs 8 gpus), and randomly initialized stable diffusion, causing the training instability. I couldn't find the version of the model weights that I used online for initialization (it's the one from version 2). We are working on releasing the dataset as well so we will release the training script after testing those. In the mean time, feel free to experiment with training randomly initialized SD with different batch size and learning rates.

greeneggsandyaml commented 1 year ago

Hello and thanks for the quick response!

Yes, the loss is diverging (as evident from the images). I do not think it is a batch size issue, as I also tried with gradient accumulation and with 8 GPUs. It must be the initialization.

For initialization, I'm slightly confused -- are you saying that you used lambdalabs/stable-diffusion-image-conditioned or a different set of weights based on SDv2? Did you convert these yourself to the format required by the ldm code?

Also, what you mean when you say that you will release the training script after the dataset -- I thought the training script was already released (in zero123/)?

Apologies for the confusion and thanks so much for the help!

ruoshiliu commented 1 year ago

I’m sorry for the confusion. We did upload the training script but the original repo does not provide the guidance and command for training because the training data and initialization checkpoint are not released. We are working on that and will release everything after the training data is ready.

‘Version 2’ refers to the image conditioned stable diffusion v2, trained by a company called Lambda Labs. I tried to find a checkpoint you can download online but I couldn’t find it. We will release our copy of the weights along with the dataset.

greeneggsandyaml commented 1 year ago

Thanks! I appreciate the quick response.

For training, that makes sense -- no rush. Did you get the image-conditioned SDv2 directly from Lambda Labs then, rather thank finding it online?

ojmichel commented 1 year ago

Hi thank you very much for sharing this amazing work. I was able to find these checkpoints from Lambda Labs https://huggingface.co/lambdalabs/stable-diffusion-image-conditioned/tree/main. Are either of these correct? The code was able to load sd-clip-vit-l14-img-embed_full.ckpt when I tried.

Also, would it be possible to release some training documentation? Even if the dataset and checkpoints are not yet able to be released, it would be helpful to have some guidance on running training.

ruoshiliu commented 1 year ago

Hi all, sorry for the delay. I've updated the readme file for the training script. Could you please try the commands here and let me know if it works?

greeneggsandyaml commented 1 year ago

Thanks! I will try it out and let you know how it works.

CiaoHe commented 1 year ago

@ruoshiliu I tried the updated training script training-script, but found the CLIPImageEncoder doesn't match with the config file sd-objaverse-finetune-c_concat-256.yaml:

sd-objaverse-finetune-c_concat-256.yaml writes

cond_stage_model.model.visual.xxxxx

but lambdalabs ckpt gives

cond_stage_model.transformer.vision_model.xxxxx

perhaps the lambdalabs ckpt listed in README doesn't matching with your original training network architecture config

ps, I found lambdalabs gitrepo https://github.com/justinpinkney/stable-diffusion, just for anyone interest And the right img-cond-sd ckpt should be here, rather below described in README @greeneggsandyaml

Download image-conditioned stable diffusion checkpoint released by Lambda Labs

ruoshiliu commented 1 year ago

@CiaoHe Did you manage to train it successfully?

CiaoHe commented 1 year ago

yeah i can train now

获取 Outlook for iOShttps://aka.ms/o0ukef

发件人: Ruoshi Liu @.> 发送时间: Wednesday, April 12, 2023 3:06:09 AM 收件人: cvlab-columbia/zero123 @.> 抄送: He Cao @.>; Mention @.> 主题: Re: [cvlab-columbia/zero123] Unable to train successfully (Issue #20)

@CiaoHehttps://github.com/CiaoHe Did you manage to train it successfully?

― Reply to this email directly, view it on GitHubhttps://github.com/cvlab-columbia/zero123/issues/20#issuecomment-1503955512, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJJ3OE2QBNZTN2BAY3MIAULXAWTSDANCNFSM6AAAAAAWJX6DTY. You are receiving this because you were mentioned.Message ID: @.***>

yuanzhi-zhu commented 1 year ago

Hi @CiaoHe,

Do you still get

missing keys: ['cc_projection.weight', 'cc_projection.bias']

after using the new img-cond-sd checkpoint (and a bunch of unexpected keys)

I am also curious how did you make it work with precision=16 as I always got errors like

RuntimeError: expected scalar type Half but found Float.

CiaoHe commented 1 year ago

Hi @CiaoHe,

Do you still get

missing keys: ['cc_projection.weight', 'cc_projection.bias']

after using the new img-cond-sd checkpoint (and a bunch of unexpected keys)

I am also curious how did you make it work with precision=16 as I always got errors like

RuntimeError: expected scalar type Half but found Float.

I think

missing keys: ['cc_projection.weight', 'cc_projection.bias']

is normal, since the original sd-ckpt doesn't have the cc_projection layer which is initialized and trained in zero123

ruoshiliu commented 1 year ago

Hi all, I tested both the pretrained checkpoint provided in our repo and the one provided on lambda labs huggingface linked by @CiaoHe here. Both should work. The former is the second version of the latter which is trained for more iterations, but shouldn't affect zero123 training too much.

Initial evaluation (without any training) should look like this in tensorboard: Orange is initialized from version 1 and blue is version 2

yhyang-myron commented 1 year ago

@ruoshiliu I tried the updated training script training-script, but found the CLIPImageEncoder doesn't match with the config file sd-objaverse-finetune-c_concat-256.yaml:

sd-objaverse-finetune-c_concat-256.yaml writes
cond_stage_model.model.visual.xxxxx
but lambdalabs ckpt gives
cond_stage_model.transformer.vision_model.xxxxx
perhaps the lambdalabs ckpt listed in README doesn't matching with your original training network architecture config

ps, I found lambdalabs gitrepo https://github.com/justinpinkney/stable-diffusion, just for anyone interest And the right img-cond-sd ckpt should be here, rather below described in README @greeneggsandyaml

Download image-conditioned stable diffusion checkpoint released by Lambda Labs

@ruoshiliu Hi, thank you for your great work! I still met this problem when using the checkpoint provided in the repo.

jh27kim commented 1 year ago

@ruoshiliu I tried the updated training script training-script, but found the CLIPImageEncoder doesn't match with the config file sd-objaverse-finetune-c_concat-256.yaml: sd-objaverse-finetune-c_concat-256.yaml writes
cond_stage_model.model.visual.xxxxx
but lambdalabs ckpt gives
cond_stage_model.transformer.vision_model.xxxxx
perhaps the lambdalabs ckpt listed in README doesn't matching with your original training network architecture config ps, I found lambdalabs gitrepo https://github.com/justinpinkney/stable-diffusion, just for anyone interest And the right img-cond-sd ckpt should be here, rather below described in README @greeneggsandyaml

Download image-conditioned stable diffusion checkpoint released by Lambda Labs
@ruoshiliu Hi, thank you for your great work! I still met this problem when using the checkpoint provided in the repo.

@ruoshiliu Hi, I also encountered the same problem. Could you please verify which checkpoint we should finetune from ?

Thanks

cvlab-columbia / zero123

Unable to train successfully #20