Open liyucheng09 opened 1 month ago
A defense strategy that inspects an LLM’s response for harmfulness could be broken using adversarial self-replicating sequences [1]. These sequences can manipulate the LLM into copying the adversarial sequence in its response, potentially compromising subsequent models that process this output. Our proposed erase-and-check framework could strengthen such defenses by applying similar mechanisms to those that check the input for harmfulness.
[1] Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications. Stav Cohen, Ron Bitton, Ben Nassi. https://arxiv.org/abs/2403.02817
If it's due to the limitation of the harmfulness detector, then I believe the exact same limitation also applies on your earse-and-check approach? As your approach also reply on this harmfulness detector. If this detector is not reliable, then how can I trust the results from erase-and-check?
The issue is not just the limitation of the harmfulness detector. The defense you propose lacks certified robustness guarantees, rendering it susceptible to novel attacks even within the threat model. In contrast, our erase-and-check framework is specifically designed to provide certified robustness guarantees, even with the inherent limitations of the harmfulness detector (safety filter). It takes a detector, which may be vulnerable to adversarial attacks, and makes it more robust. That’s the point of erase-and-check.
My question is simple. Compared to directly-check the response, what makes earse-and-check more robust? Are you simply suggesting it's less likely that the detector fails to detect harmfulness on all response from all variants?
What prevents you from directly check the response for the input in the first place, if the response is safe, then return this to the user, if not, then just reject to answer? Why bother to use this erase-and-check thing at all?