Doesn't converge when I train with my own data

TencentARC / BrushNet

[ECCV 2024] The official implementation of paper "BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion"

https://tencentarc.github.io/BrushNet/

Other

1.33k stars 111 forks source link

Doesn't converge when I train with my own data #67

Open zf-666 opened 1 month ago

zf-666 commented 1 month ago

Loss has been shaking.， Wish I could see a picture of the correct loss，

CharlesGong12 commented 4 weeks ago

Me too. It confuses me a lot. Have you solved it? @zf-666 Or could the authors help us please? @juxuan27

SmileTAT commented 3 weeks ago

How did you build your dataset?

MrWH123 commented 3 weeks ago

hi @juxuan27 @yuanhangio, thanks for such great work! Could you plz share some details on training Brushnet_sdxl? such as how many epochs, how long it takes, how many GPU used? I just use one zip package in BrashData just want to realize the training detail. Butt I find the loss fluctuation even if 11000+ steps passed (one zip has 10000 images, with batch size 4, around 4epoch). I guess the fluctuation loss is related to the random timestep during training. but how to identify when the model is converged if loss has little guidance? similar issue seem in (https://github.com/TencentARC/BrushNet/issues/35), but didn't find some explanation on the loss

zf-666 commented 3 weeks ago

I fix the problem by use 'fp16' and 'fp16 vae' , but another problem arises, my dataset is on the dark side, but the resulting data, while fitting the distribution, is always on the light side

xduzhangjiayu commented 3 weeks ago

hi @juxuan27 @yuanhangio, thanks for such great work! Could you plz share some details on training Brushnet_sdxl? such as how many epochs, how long it takes, how many GPU used? I just use one zip package in BrashData just want to realize the training detail. Butt I find the loss fluctuation even if 11000+ steps passed (one zip has 10000 images, with batch size 4, around 4epoch). I guess the fluctuation loss is related to the random timestep during training. but how to identify when the model is converged if loss has little guidance? similar issue seem in (#35), but didn't find some explanation on the loss

Hi, which resolution of images did you used for training? Only 1024x1024 or random resolution? Appreciate for the reply!

MrWH123 commented 2 weeks ago

hi @juxuan27 @yuanhangio, thanks for such great work! Could you plz share some details on training Brushnet_sdxl? such as how many epochs, how long it takes, how many GPU used? I just use one zip package in BrashData just want to realize the training detail. Butt I find the loss fluctuation even if 11000+ steps passed (one zip has 10000 images, with batch size 4, around 4epoch). I guess the fluctuation loss is related to the random timestep during training. but how to identify when the model is converged if loss has little guidance? similar issue seem in (#35), but didn't find some explanation on the loss

Hi, which resolution of images did you used for training? Only 1024x1024 or random resolution? Appreciate for the reply!

1024x1024 for SDXL

MrWH123 commented 2 weeks ago

I fix the problem by use 'fp16' and 'fp16 vae' , but another problem arises, my dataset is on the dark side, but the resulting data, while fitting the distribution, is always on the light side

could you share your training hyper-parameters and loss figure?

shaoyandea commented 1 week ago

@yuanhangio @juxuan27 there are only about 5-10 images in own data set, can brushnet converge?? how many images at leat should I prepare??

xduzhangjiayu commented 4 days ago

I use BrushData as my dataset, but in some data missing "width" in .tar file, so the training process is failed. Is anyone know how to fix this in train_brushnet.py to skip this sample and continue training? Thanks so much!