facebookresearch / stable_signature

Official implementation of the paper "The Stable Signature Rooting Watermarks in Latent Diffusion Models"
Other
385 stars 48 forks source link

Whitening seems to destroy the decoder? #2

Closed zjysteven closed 1 year ago

zjysteven commented 1 year ago

Hi Pierre,

When I evaluated the whitened watermark decoder hidden_replicate_whit.torchscript.pth, the bit accuracy of clean (non-attacked) watermarked images is 49.42% (on 5000 COCO val images; each with a random key). I'm not sure what's wrong here and would appreciate any suggestions/thoughts. Below is some additional information that might be helpful.

  1. I was using my own evaluation script. I could get a clean accuracy of 100% when using the non-whitened decoder in hidden_replicate.pth, so I think that my evaluation script works okay.
  2. I trained several other models using the provided pre-train code hidden/main.py and observed similar results when doing the whitening.
  3. The same observations also apply to the checkpoints dec_48b.pth (99.99% bit accuracy) and dec_48b_whit.torchscript.pt (50.09% bit accuracy).

Thank you

zjysteven commented 1 year ago

This colab notebook (adapted from the provided demo) also reproduces this problem.

pierrefdz commented 1 year ago

Hi Jingyang,

Indeed, the whitening changes the output of the extractor. Its goal is to make the output bits more independent and well distributed (see appendix B.5. and B.6. of the paper), otherwise, some keys would have higher FPRs than others. In our case, since we discard the encoder at the end (we only use the extractor for the fine-tuning in stable signature), we can change the extractor and do the whitening. However, if you use the encoder to watermark your images, the message that you end up putting in your images will be completely changed by the whitening layer at extraction time (tell me if I'm not clear enough).

There are 2 options: (1) use the vanilla extractor without whitening, this is not as much of an issue if you don't need perfect theoretical control over false positive rates, (2) you can try feeding "reversed" messages to the watermark encoder: for instance if your message is $m$ and the whitening layer is does $L(x) = Wx+b$, then you can try feeding $W^{-1}(m-b)$ as message to the encoder, s.t. when it get extracted the last layer will give $W(W^{-1}(m-b))+b = m$ (but I have never tried this so I can't say if it will work).

tl;dr: use the un-whitened extractor if you want to use the encoder of hidden

zjysteven commented 1 year ago

Thank you Pierre! I get the sense and now I see why this is actually expected.