Question Regarding Adversarial Attack Part

futakw commented 2 months ago

Hi,

I have a question on the adversarial attack part.

https://github.com/chkimmmmm/R.A.C.E./blob/8913eda752f6087f6d240140a81eccbe2950c3fb/train-scripts/train-esd.py#L326

My understanding is that while the original SD model can predict noise $\epsilon$ on $zt$ to generate a cleaned image $z{t-1} = zt - \epsilon$, the ESD model cannot predict $\epsilon$ for an unlearned concept. Then, the PGD attack aims to prompt the unlearned model to "re"-generate a cleaned image $z{t-1}$ (with an unlearned concept), by enforcing the prediction to be close to $\epsilon$.

However, it seems that the PGD attack enforces the predicted noise to be closer to the start_code = $z_T$, instead of $\epsilon = zt - z{t-1}$ (which is unknown?).

I'm curious why this approach is effective. Please correct me if I've misunderstood the algorithm.

futakw commented 2 months ago

Sorry, I misunderstood the Diffusion process itself, this was not related to adversarial attack implementation. SD model predicts $\epsilon = z_T - z_0.$

I will close this issue.

chkimmmmm commented 2 months ago

Hey Futa, Sorry for the late reply. I was busy.. But I am so happy that you found the solution!

chkimmmmm / R.A.C.E.

Question Regarding Adversarial Attack Part #2