About the problems of the buffer

pipiPdesu commented 1 month ago

I found somethings wrong with the buffer which makes it doesn't work.

The buffer doesn't sort until it's full. This means that the buffer ignores init_buffer_losses. It directly selects the first candidate, optimizes it, and then puts it back. It also replaces the last candidate while ignoring the loss as well. However, this only occurs in the first step. I manage to fix it by sorting every time something is added.
I try to fix condense problem for init buffer. I directly select 20 letters and concatenate them with a space. This approach works for internlm but still fails to function on the Llama3 tokenizer. I noticed that '!' '?' '.' are more likely to be condensed, so I remove these three letters, and then it functions properly with the Llama3 tokenizer XD. It should be noted that this may lead to a decrease in attack performance and I just test on these two model‘s tokenizer.
I'm quite confused of the strategy of the buffer. As mentioned in here, it is no different from simply recording the best loss and reverting to it when the suffix fails to obtain a better loss. I've noticed that neither ACG nor QCG mentions the pop operation so I am quite curious about how it works.

Do you have any idea about 2 and 3? Thanks for your encouragement！

justinwangx commented 1 month ago

sorry for the delay on my end - thank you for digging in to this!

do you mean that removing '!' '?' '.' from the initialization fixes the issue without concatenating letters with a space for Llama3? if so, we can just do this. even if these characters allow for slightly better optimizations, everything is lost if the optimized sequence gets condensed to something entirely different.
do you mean I-GCG? i'm not aware of a QCG. it looks like ACG does mention the pop operation. the ACG blog mentions that their algorithm tends to explore the same candidate in the buffer, which is the same problem we're having. but interestingly, they report that buffer_size=16 still allows for exploration. do you want to try experimenting with randomly selecting an attack from the buffer at each iteration rather than choosing the one with the lowest loss? this will definitely allow for exploration, and if we limit batch size i'm curious if we can retain performance.

would be great if you can pr for 1, either now or after we've resolved 2 and 3

pipiPdesu commented 1 month ago

I've made pr for 1 and 2, for 3, my bad for the confusion on QCG—I was referring to GCQ which appears in the reference(I'll checkout I-GCG later).

As for the buffer, If we only optimize for the minimum candidate and do not care about other candidates, there is no difference between a min-heap and simply recording the minimum value, so the size of the buffer has no impact. Randomly selecting a candidate is a good idea and I'd like to try experimenting this.

By the way, do you have any baseline data for original GCG and NanoGCG? I've conducted some tests on AdvBench using the default parameters for NanoGCG and noticed that its performance isn't as impressive as I had anticipated. I believe that even with the num_steps limited to 250 , NanoGCG should perform better.

justinwangx commented 1 month ago

ah, i see -- curious to see how those experiments go. the buffer update in GCQ would be different, i don't think we want to introduce a proxy model for computing proxy losses...

what metric are you using to judge performance? there shouldn't be a drastic difference in ASR. on llama-3, nanoGCG with the default settings does an iteration in ~2 seconds, while the HarmBench implementation takes ~3 seconds with the same settings (on an 80GB A100). the llm-attacks implementation is close to the HarmBench implementation in terms of speed, iirc

pipiPdesu commented 4 weeks ago

Thanks for your reply!

Apologies for the ambiguity. Regarding the buffer, what is being discussed here is its implementation: as described above, the current buffer mechanism renders the buffer size somewhat meaningless.

In terms of performance, it refers to its ASR, where I attempted to evaluate the first 14 harmful behaviors on llama2-7b-chat-hf using advbench with the addition of a system prompt, and found that only 2 reached early stopping within 250 steps. You can refer to my experimental code, logs, and results here. I think this does not align with GCG's original performance, so I am curious whether there might be an issue with my experimental setup or something went wrong.

After resolving this issue, we can use the early stopping steps as the metric to evaluate the effectiveness of various new features.

justinwangx commented 3 weeks ago

yes, random selection from the buffer should make buffer size meaningful.

i don't think this is a good sanity-check, actually. llama is more difficult to jailbreak -- i recommend running 500 steps by default, with early stopping. note that, since nanoGCG just runs GCG by default, there shouldn't be a difference in jailbreaking capability between default nanoGCG and the original llm-attacks implementation (the relevant performance metric that differs is speed, e.g. seconds per GCG iteration). i wouldn't be surprised if the llm-attacks implementation also gets 2 / 14 to the same early stop loss when running for 250 steps.

pipiPdesu commented 3 weeks ago

I reran the experiment with 1000 steps and found that only 3 out of 15 reached early stopping ,log is here.

I think there might be an issue somewhere, maybe related to #20 or #22. I also find some issue with HarmBench, so I think segment embedding might not be a reasonable approach. I plan to conduct some ablation experiments to identify the issue.

justinwangx commented 9 hours ago

did you check the model generations? my hunch is that there will be more jailbreaks despite not reaching early stopping (we have a strong early stopping requirement)

GraySwanAI / nanoGCG

About the problems of the buffer #15