andreas128 / RePaint

Official PyTorch Code and Models of "RePaint: Inpainting using Denoising Diffusion Probabilistic Models", CVPR 2022
1.99k stars 160 forks source link

How to train or finetune on custom dataset? #20

Open universewill opened 2 years ago

universewill commented 2 years ago

How to train or finetune on custom dataset?

kafei123456 commented 2 years ago

have you get the train code?

andreas128 commented 2 years ago

You need to train using the original guided-diffusion code base.

To get the same commit as we used do the following:

git clone https://github.com/openai/guided-diffusion.git
cd guided-diffusion
git checkout 912d577

For training we use 256 × 256 crops in three batches on 4×V100 GPUs each.

Does that work for you?

xyz-xdx commented 1 year ago

@universewill Can you use your own dataset to train the model?

jbnjvc10000 commented 1 year ago

Could you provide the parameter about training? For example,imagenet256's training parameter?

MaugrimEP commented 11 months ago

I did try to reproduce the training details on CelebaHQ without success. I updated the guided-diffusion params using the face_example.yml file. Are those the params used to train? The loss stay at NaN for multiple steps, like a lot. But if I put the default conf, only for some step and it get better. What kind of behavior is expected?

jbnjvc10000 commented 11 months ago

我确实尝试在 CelebaHQ 上重现培训细节,但没有成功。我使用 face_example.yml 文件更新了 guided-diffusion 参数。这些是用于训练的参数吗?损失在 NaN 停留了多个步骤,就像很多一样。但是,如果我放置默认的conf,只需执行一些步骤,它就会变得更好。预期会出现什么样的行为?

  1. The parameter in face_example.yml is the training parameter.
  2. Modify some parameters will modify the network(like u-net structure)
MaugrimEP commented 11 months ago

@jbnjvc10000 did you tried? when I set num_channels to 256, i often have NaN in the loss which skip the training step. And it does not go away

jbnjvc10000 commented 11 months ago

@jbnjvc10000 did you tried? when I set num_channels to 256, i often have NaN in the loss which skip the training step. And it does not go away

@MaugrimEP Sorry I didn't meet your problem. Maybe you should make sure your dataset and parameters are the same as the paper provides?

MaugrimEP commented 11 months ago

@jbnjvc10000 I solved my issue. It was system-related. I tried the same code under Linux and Windows. Windows produces NaN and Linux was perfect. :> classic

jbnjvc10000 commented 11 months ago

@MaugrimEP

By the way, it may be a file could not be find because windows had a different path

LinWeiJeff commented 11 months ago

@jbnjvc10000 did you tried? when I set num_channels to 256, i often have NaN in the loss which skip the training step. And it does not go away

@MaugrimEP Hello, what was the type of GPU you used to train?

I used a single 12GB GeForce RTX4070 GPU to train model using guided-diffusion, and I trained my model based on my own dataset consisting of 120000 high quality human facial images (each image is 256*256 pixels and RGB), however, if I set num_channels to 256, whatever batch_size I set, when I run the training command, it gave me error message: RuntimeError: CUDA out of memory. I think it meant my GPU memory capacity isn't enough, so I am curious about what the type of GPU you used to train was. I will be grateful if you reply to me !