Stability-AI / generative-models

Generative Models by Stability AI
MIT License
24.03k stars 2.68k forks source link

Training code for the Adversarial Diffusion Distillation(ADD) not available? #238

Open Mohan2351999 opened 9 months ago

Mohan2351999 commented 9 months ago

I was not able to find the code for the ADD training mechanism, when will the code be released?

tnickMoxuan commented 9 months ago

Looking forward to update the training code.

m-muaz commented 9 months ago

Same question. Is the training code planned to be released soon?

jon-chuang commented 8 months ago

Actually, if you look at the ADD paper, they train StyleGAN-T++ for 2M iterations at batch size 2048 on 128 A100s. This suggests that the project had a budget that allows for ~100K USD experiments. So I highly doubt the ordinary person is going to be able to replicate their result, even with the training code available.

It is probably more appropriate to think of the ADD model as training an SD model almost from scratch. The problem it learns is much harder than LCM - they have to go from noise straight to a highly polished image.

LCM never manages to do that as the original training process of SD is not designed to do few-step denoising, so my hypothesis is that ADD has to learn a lot of new "concepts".

Mohan2351999 commented 8 months ago

@jon-chuang , thanks for your feedback, I tired to implement a similar training mechanism as what ADD is doing, but it seem to having have a lot of instability in the training, which doesn't yeild good images, but looking at the papers, I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?

fingerk28 commented 8 months ago

@Mohan2351999 , Have you achieved good results with your ADD training? I've also tried training a ADD, but the images generated after a few training looked terrible, like those from a failed GAN training.

Mohan2351999 commented 8 months ago

@fingerk28 I was getting a similar image which becomes complete noise, with longer training, which probably could be due to instability in the training, I still face the issue of 'nan' in the grad_norm of the discriminator while training. Please let me know if you find any success with your training. thanks.

jon-chuang commented 8 months ago

I think they mention that they train for 4k iterations on batch size 128 right(details in ADD paper, page 6, table 1)?

Ok, you're right, colour me surprised. I expected stability AI (and all major for-profit labs) to retract details like that.

but it seem to having have a lot of instability in the training,

I have the same result (and others I've talked to have reported the same).

But GAN training is generally very hard to tune.

I still face the issue of 'nan' in the grad_norm of the discriminator while training.

I think in the ADD paper they mention using R1 gradient penalty as regularization. I have yet to try this.

jon-chuang commented 8 months ago

Btw @Mohan2351999 do shoot me an email at chuang dot jon at gmail dot com if you want to chat about this more offline. I'm quite determined to have this ADD training suceed.

Mohan2351999 commented 8 months ago

Hi @jon-chuang, thanks for your answers, I have already tried including the R1 gradient penality, but still couldn't get rid of the "nan" in the gradient norm for the discriminator.

Thanks for sharing your contact, I will send you an email soon.

YangPanHZAU commented 8 months ago

@Mohan2351999 @jon-chuang I have also tried reproduced ADD recently, and I have some doubts about the training data. Is it the Laion dataset ? Will the quality of training data have a significant impact on adversarial training?

MqLeet commented 7 months ago

@jon-chuang @Mohan2351999 Hi, have you obtained good generation results? I used the training method of ADD, but the generated images have color issues, such as oversaturation...

Just like this 9d4ff71a3074ad9af42b3ebcc51c63b

And I don't know what the problem is...

leffff commented 7 months ago

Hey, there! While the code for ADD is still unpublished, I started working on my own implementation. In a couple of weeks I will be able to train and test my model. For my tests I have trained my own (toy) UNet on food101 dataset. And will further distill it.

Will be glad to receive any comments and pieces of advice on my work!

https://github.com/leffff/adversarial-diffusion-distillation/

digbangbang commented 6 months ago

Hi, the paper says that the step size of the teacher model is set to 1. I think this is unreasonable. I tried to use ddpm CIFAR10 to conduct ADD experiments. When the teacher model step size is 1, sampling is performed, and the result is a picture of completely random noise. Or is it that their teacher model is already sufficient to generate higher quality images in step 1?

jonaskohler commented 6 months ago

you're right. The single step teacher is quite useless. You can see this from Table 1 d) by comparing the first and second row

leffff commented 6 months ago
Снимок экрана 2024-02-22 в 12 21 15 Снимок экрана 2024-02-22 в 12 21 02

Here are screenshots form the paper, proving they do only 1 teacher step, which is in my opinion unreasonable as we force student to produce samples of the best quality possible but in 4 steps instead of all the steps of the teacher, meaning, the teacher should make more steps.

But imagine, teacher makes less steps than the student. This means, generation quality of the teacher is worse than students'. Then why do we want students' predictions to be as close as possible to teachers'.

I do not understand this moment yet.

in this video https://www.youtube.com/watch?v=ZxPQtXu1Wbw the author says that the teacher makes 1000 steps)

digbangbang commented 6 months ago

@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!

leffff commented 6 months ago

@leffff I also have the same question. The experiments in the paper also show that the discriminator plays the main role. Do you currently have any results on code reproduction? I used part of your code to try to reproduce the uncondition situation on CIFAR10. The training time may be longer than I thought. If there is any progress, we can communicate at any time. Thanks, bro!

I will soon change my UNet and dataset and switch either to Imagenet or Cifar10! If I succeed I will inform you! Waiting for your results :)

leffff commented 6 months ago

Okay I've figured out the answer.

The main contribution to distillation is made by the discriminator, while teacher is there to prevent overfitting and this is the reason the teacher only does 1 step.

jonaskohler commented 6 months ago

@leffff Thanks for the explanation! Did you uncover any training hacks that were not mentioned in the paper? And are you getting good results for a single step?