Open VRehnberg opened 11 months ago
There are two things I've been trying to find reported ASR for detection methods and how well cited different methods are.
Best overview of performance I've found is this table:
From this ASSET paper. Unsure how representative is.
Top-3 according to that paper are:
Seemingly good methods according to that table
BackdoorBox looks a lot more promising than BackdoorBench and might just be wrappable.
I do think it would be nice to have a method that works against WaNets, which apparently rules out quite a few.
BackdoorBox looks a lot more promising than BackdoorBench and might just be wrappable.
Wrapping it would be nice if it doesn't require too many hacks and workarounds. Though looking at the commit history, it doesn't seem like they're adding methods very actively, so if we only want one or two of the methods they have implemented, I'm not sure whether wrapping is easier than reimplementing them. (I haven't looked at their code, let me know if getting my sense of how hard it would be to integrate would be useful!)
ASSET is interesting even apart from performance. Summary of the idea from their paper:
The key idea of our approach is to induce different model behaviors between poisoned samples and clean ones. To achieve this, we design a two-step optimization process we first minimize some loss on the clean base set; then, we at- tempt to offset the effect of the first minimization on the clean distribution by maximizing the same loss on the entire training set including both clean and poisoned samples. The outcome of this two-step process is a model which returns high loss for poisoned samples and low loss for clean ones.
This sounds very similar to what Paul has written about here, and I've experimented a bit with something like this in an abstraction setting (except that this is more like the "finetuning-version" of that idea). Unlike normal finetuning, this could also work for detecting adversarial examples or broken spurious correlations.
Particularly interesting is this part:
The effect of the second maximization significantly outweighs that of the first minimization; as a result, both poisoned and clean samples achieve large losses and become inseparable. To tackle the challenge, we propose a strengthened tech- nique that involves two nested offset procedures, and the inner offset reinforces the outer one. Specifically, we use the inner offset procedure to identify the points most likely to be poisoned and mark them as suspicious; the outer offset procedure still minimizes some loss on the clean base set, but the maximization will now be performed on the points marked to be suspicious by the inner offset, instead of the entire poisoned dataset.
This was sometimes a problem in practice for me too, might be worth trying their approach. Might not be the easiest to implement and fiddly to get to work though, so overall not sure how much of a priority this should be---if we can get faster results from other methods, that's also a consideration.
I do think it would be nice to have a method that works against WaNets, which apparently rules out quite a few.
Remaining would be:
Wrapping it would be nice if it doesn't require too many hacks and workarounds. Though looking at the commit history, it doesn't seem like they're adding methods very actively, so if we only want one or two of the methods they have implemented, I'm not sure whether wrapping is easier than reimplementing them. (I haven't looked at their code, let me know if getting my sense of how hard it would be to integrate would be useful!)
I don't think wrapping would be easier. Wrapping could be useful to have one less standard, but let's skip that for now then.
ASSET is interesting even apart from performance. Summary of the idea from their paper:
The key idea of our approach is to induce different model behaviors between poisoned samples and clean ones. To achieve this, we design a two-step optimization process we first minimize some loss on the clean base set; then, we at- tempt to offset the effect of the first minimization on the clean distribution by maximizing the same loss on the entire training set including both clean and poisoned samples. The outcome of this two-step process is a model which returns high loss for poisoned samples and low loss for clean ones.
This sounds very similar to what Paul has written about here, and I've experimented a bit with something like this in an abstraction setting (except that this is more like the "finetuning-version" of that idea). Unlike normal finetuning, this could also work for detecting adversarial examples or broken spurious correlations.
Particularly interesting is this part:
The effect of the second maximization significantly outweighs that of the first minimization; as a result, both poisoned and clean samples achieve large losses and become inseparable. To tackle the challenge, we propose a strengthened tech- nique that involves two nested offset procedures, and the inner offset reinforces the outer one. Specifically, we use the inner offset procedure to identify the points most likely to be poisoned and mark them as suspicious; the outer offset procedure still minimizes some loss on the clean base set, but the maximization will now be performed on the points marked to be suspicious by the inner offset, instead of the entire poisoned dataset.
This was sometimes a problem in practice for me too, might be worth trying their approach. Might not be the easiest to implement and fiddly to get to work though, so overall not sure how much of a priority this should be---if we can get faster results from other methods, that's also a consideration.
I'm a bit sceptical to methods that need the original poisoned training data besides the clean datasets to train the detector. The main exception is if the detector generalizes to other attacks than it was trained on (if trained on backdoors it can detect adversarial attacks or vice versa).
But still, might worth investigating.
I haven't found a good survey comparing methods yet, here are some of the techniques I've found.
Some surveys anyway are:
I would roughly split these into three groups:
I'll probably start with Spectral/Spectre and then perhaps MagNet after that. Seems easy enough without having looked at the details yet.
ASSET requires some more work so I'll hold off on that for now.
I'm a bit sceptical to methods that need the original poisoned training data besides the clean datasets to train the detector.
I think such methods can still be interesting if individual failures aren't too bad, and we mainly want to avoid consistently making mistakes in the long run (what Paul calls the "low-stakes" setting). In that case, we can use deployment inputs as the mixed set of clean+poisoned---early on, we won't be able to distinguish them yet, but at some point we'll have seen enough poisoned data. A potential problem is if the poisoning rate is too low, in which case we might keep making mistakes forever with most detection methods. But I think even that is fine in many applications. For example, if deceptive AIs can only take treacherous actions in 1% of cases, that makes me more optimistic about preventing takeover (though we'd still have to prevent certain forms of coordination between AIs). Ironically, I'm less sure how helpful this is in the backdoor detection case---intuitively it does feel like you'll only see a very low poison rate during deployment for many applications, and it's still bad. But I'm sure there are deployment scenarios where you're only worried about high rates of attempted attacks.
I would roughly split these into three groups:
Using sensitivity to input augmentation Using statistics of activations (learnt features) Adversarial training to increase robustness
I think a guiding principle for choosing which ones to implement is whether they're specific to e.g. image adversarial attacks, or could also be applied to many other tasks in our benchmark. Input augmentations will often be pretty specific I assume, but generic methods like MagNet seem promising. Activation statistics could be pretty general---my sense is that they are often designed specifically with adversarial examples in mind, but could still be interesting to try them, especially ones that seem reasonable a priori for other tasks.
I think we might care less about adversarial training, since it's unclear how you'd apply that to cases like preventing measurement tampering. (And on the other hand, for deception, the main problem would be that finding the adversarial inputs might be very hard, which is a pretty separate topic from what we're doing.)
I'll probably start with Spectral/Spectre and then perhaps MagNet after that. Seems easy enough without having looked at the details yet.
ASSET requires some more work so I'll hold off on that for now.
Sounds good, I agree with the decision to start with easier to implement methods.
Looking at what collections of results there are to compare against (besides original papers) I've found that these are what is covered by the three biggest sources I've found.
Backdoorbench combinations:
Asset table:
Confusion Training table:
Looking at this it seems like what we have covered now and can compare against are:
Which seems a bit limited. On the other hand these papers does not seem to run into memory issues for statistical detectors so we should be able to get numbers for those as well.
We currently have 3 detectors. In this issue I will investigate some possible new additions.
Top candidates: