THU-BPM / unforgeable_watermark

Source code of paper "An Unforgeable Publicly Verifiable Watermark for Large Language Models" accepted by ICLR 2024
https://arxiv.org/abs/2307.16230
23 stars 2 forks source link

Questions about threat model #1

Closed Tongzhou0101 closed 2 months ago

Tongzhou0101 commented 6 months ago

Hi,

I found your work quite interesting! However, I have some questions regarding the threat model section.

Firstly, you introduce two attack scenarios (removal and forgery) with different assumptions about access to detection. You mention that attackers aiming to remove watermarks do not have access to detection, as it would aid in removal attacks. Does this imply that the watermark is not publicly verifiable in this case? Additionally, given the unpredictability of attackers' intentions (removal or forgery), how can access to detection be determined? What if an attacker has both objectives?

Secondly, for attackers focused on forgery who do have access to detection, is the detector model considered a black box or white box to them?

I would appreciate any clarification you can provide on these points. Thank you!

zhangbl6618 commented 3 months ago

For the second question, I think the detector model is a white box to attackers.

exlaw commented 3 months ago

Sorry I just saw this issue.

Thank you for your interest in our work. The overall assumption is that the attacker can obtain the complete detection model, which is a white-box setting. In the watermark removal attack setting, we did not actually add experiments where the attacker leverages the detector to develop better attack methods - this could be a future direction. Our testing of watermark removal attacks was more to test that our watermarking algorithm does not significantly degrade compared to KGW in the face of some classical removal attacks (such as text rewriting).