How to design adaptive attacks against complex models or even non-differentiable models?

The forward pass through the LLM should actually be differentiable, right? However, differentiating through the LLM might of course supply too noisy and therefore unusable gradients. If that is the case, you need to come up with some differentiable surrogate function that replaces the LLM and is faithful to it in the vicinity of the specimen:

The first thing you should try is to just ignore the LLM, i.e., have a surrogate function that is just constant and always outputs what the LLM outputted for the specimen. This constant surrogate of course has 0 gradient everywhere. If you can find an successful attack with this surrogate function, you've also proved that the usage of an LLM was actually pointless, and the defense is just as robust without it.
Otherwise, a basic idea could be: for each specimen, sample the LLM in the vicinity of the specimen, and use those samples to train a tiny neural network to imitate it in this local neighborhood. You might need to widen the parts of the defense you replace from just the LLM to also include the pre- and postprocessing before and after the LLM. This approach is highly dependent on your use case.
Or maybe you can think of some other clever surrogate. Again, this is highly dependent on the defense.

LoadingByte / are-gnn-defenses-robust

How to design adaptive attacks against complex models or even non-differentiable models? #4