More detectors - Githubissues

ejnnr / cupbearer

A library for mechanistic anomaly detection

MIT License

13 stars 9 forks source link

More detectors #15

Open VRehnberg opened 11 months ago

VRehnberg commented 11 months ago

We currently have 3 detectors. In this issue I will investigate some possible new additions.

Top candidates:

[ ] Neural Cleanse (most cited backdoor detection method)
[ ] Fine-Pruning (second most cited and natural extension of finetune method)
[x] Spectral/Spectre/Beatrix (Seems relatively easy to implement and can detect WaNet somewhat)
[ ] ASSET (best one according to themselves)
[ ] MagNet/Statistica Detection/LID/Feature Response Maps (the adversarial detection techniques that uses activation anomaly detection)

VRehnberg commented 11 months ago

Backdoors

There are two things I've been trying to find reported ASR for detection methods and how well cited different methods are.

List of backdoor detectors in order of citations

909 Neural Cleanse
- To detect backdoors, Neural Cleanse analyzes each output label of the neural network separately. For a given label, it tries to find the minimum perturbation (measured by L1 norm) needed to misclassify an inputs into that label.
- Not computationally intractable for cases with many labels (i.e. LMs are out)
656 Fine-Pruning
- As the name implies: finetune and prune based on clean data to get rid of backdoors
526 Activation Clustering
- Use clustering on activations on data with some degree of poisoning (but knowing which data is poisoned is not needed). Investigate clusters to exclude poisoned data.
504 Spectral
- Compute outlier scores based on SVD of the activations
471 Strip
- Perturb inputs, a sensitivity to perturbation implies a backdoor is present.
221 Neural Attention Distillation
- Train a teacher network on the clean data and retrain poisoned model to use similar attention an teacher network
172 SentiNet
- Find mask of input image for reason of classification (with Grad-CAM) and see if clean images are also sensitive to inclusion of this part of the input image.
135 ~Anti-Backdoor Learning~
- Is used during training to help prevent learning backdoors
115 ANP
- Neurons sensitive to perturbing input images are pruned
112 SCAn
- "we leverage an EM algorithm to decompose an image into its identity part (e.g., person, traffic sign) and variation part within a class (e.g., lighting, poses). Then we analyze the distribution in each class, identifying those more likely to be characterized by a mixture model resulted from adding attack samples to the legitimate image pool."
96 Frequency
- Most backdoors lead to high frequency artifacts.
84 ~DBD~
- A training procedure to avoid learning backdoors
71 Spectre
- An improvement to Spectral
17 Beatrix
- Using Gram matrices of activations to detect anomalies
5 ASSET
- Finetune with the clean samples while untraining on batches that are thought to include poisoned data
4 ~Confusion Training~
- Learn a confusion model that is trained on data thought to include poisoned data and clean data with random labels. If this classifies something in accordance with the label it was given then it is probably poisonous.
- AutoEncoder TODO
- ShrinkPad TODO
- MCR TODO
- CutMix TODO

ASSET table

Best overview of performance I've found is this table:

From this ASSET paper. Unsure how representative is.

Top-3 according to that paper are:

Confusion Training table

Seemingly good methods according to that table

Confusion Training
NAD
Spectre

Existing collections of implementations

BackdoorBox looks a lot more promising than BackdoorBench and might just be wrappable.

ejnnr commented 11 months ago

I do think it would be nice to have a method that works against WaNets, which apparently rules out quite a few.

BackdoorBox looks a lot more promising than BackdoorBench and might just be wrappable.

Wrapping it would be nice if it doesn't require too many hacks and workarounds. Though looking at the commit history, it doesn't seem like they're adding methods very actively, so if we only want one or two of the methods they have implemented, I'm not sure whether wrapping is easier than reimplementing them. (I haven't looked at their code, let me know if getting my sense of how hard it would be to integrate would be useful!)

ASSET is interesting even apart from performance. Summary of the idea from their paper:

The key idea of our approach is to induce different model behaviors between poisoned samples and clean ones. To achieve this, we design a two-step optimization process we first minimize some loss on the clean base set; then, we at- tempt to offset the effect of the first minimization on the clean distribution by maximizing the same loss on the entire training set including both clean and poisoned samples. The outcome of this two-step process is a model which returns high loss for poisoned samples and low loss for clean ones.

This sounds very similar to what Paul has written about here, and I've experimented a bit with something like this in an abstraction setting (except that this is more like the "finetuning-version" of that idea). Unlike normal finetuning, this could also work for detecting adversarial examples or broken spurious correlations.

Particularly interesting is this part:

The effect of the second maximization significantly outweighs that of the first minimization; as a result, both poisoned and clean samples achieve large losses and become inseparable. To tackle the challenge, we propose a strengthened tech- nique that involves two nested offset procedures, and the inner offset reinforces the outer one. Specifically, we use the inner offset procedure to identify the points most likely to be poisoned and mark them as suspicious; the outer offset procedure still minimizes some loss on the clean base set, but the maximization will now be performed on the points marked to be suspicious by the inner offset, instead of the entire poisoned dataset.

This was sometimes a problem in practice for me too, might be worth trying their approach. Might not be the easiest to implement and fiddly to get to work though, so overall not sure how much of a priority this should be---if we can get faster results from other methods, that's also a consideration.

VRehnberg commented 11 months ago

I do think it would be nice to have a method that works against WaNets, which apparently rules out quite a few.

Remaining would be:

Beatrix
Confusion Training
ASSET
Spectral (SS)
Spectre
SCAn
NAD

Wrapping it would be nice if it doesn't require too many hacks and workarounds. Though looking at the commit history, it doesn't seem like they're adding methods very actively, so if we only want one or two of the methods they have implemented, I'm not sure whether wrapping is easier than reimplementing them. (I haven't looked at their code, let me know if getting my sense of how hard it would be to integrate would be useful!)

I don't think wrapping would be easier. Wrapping could be useful to have one less standard, but let's skip that for now then.

ASSET is interesting even apart from performance. Summary of the idea from their paper:

The key idea of our approach is to induce different model behaviors between poisoned samples and clean ones. To achieve this, we design a two-step optimization process we first minimize some loss on the clean base set; then, we at- tempt to offset the effect of the first minimization on the clean distribution by maximizing the same loss on the entire training set including both clean and poisoned samples. The outcome of this two-step process is a model which returns high loss for poisoned samples and low loss for clean ones.

This sounds very similar to what Paul has written about here, and I've experimented a bit with something like this in an abstraction setting (except that this is more like the "finetuning-version" of that idea). Unlike normal finetuning, this could also work for detecting adversarial examples or broken spurious correlations.

Particularly interesting is this part:

The effect of the second maximization significantly outweighs that of the first minimization; as a result, both poisoned and clean samples achieve large losses and become inseparable. To tackle the challenge, we propose a strengthened tech- nique that involves two nested offset procedures, and the inner offset reinforces the outer one. Specifically, we use the inner offset procedure to identify the points most likely to be poisoned and mark them as suspicious; the outer offset procedure still minimizes some loss on the clean base set, but the maximization will now be performed on the points marked to be suspicious by the inner offset, instead of the entire poisoned dataset.

This was sometimes a problem in practice for me too, might be worth trying their approach. Might not be the easiest to implement and fiddly to get to work though, so overall not sure how much of a priority this should be---if we can get faster results from other methods, that's also a consideration.

I'm a bit sceptical to methods that need the original poisoned training data besides the clean datasets to train the detector. The main exception is if the detector generalizes to other attacks than it was trained on (if trained on backdoors it can detect adversarial attacks or vice versa).

But still, might worth investigating.

VRehnberg commented 11 months ago

Adversarial attacks

I haven't found a good survey comparing methods yet, here are some of the techniques I've found.

Some surveys anyway are:

2018 https://arxiv.org/abs/1810.00069
2022 https://www.mdpi.com/2079-9292/11/8/1283
2023 https://arxiv.org/abs/2305.10862 (this one has a table for robustness metrics, but most techniques will probably not be useful for us (haven't checked these))

List of some adversarial detectors in order of citations

1447 Features Squeezing (quantize pixel values and do spatial smoothing, if this leads to a different response then a backdoor is present)
997 MagNet (approximate manifold for clean data and anything outside that is an anomaly)
882 Randomization (Random resize and random padding, if output is different than original, then anomaly)
874 ~Certified Defense~ ("We first pro- pose a method based on a semidefinite relaxation that outputs a certificate that for a given network and test input, no attack can force the error to exceed a certain value. Second, as this certificate is differentiable, we jointly optimize it with the network parameters, providing an adaptive regularizer that encourages robustness against all attacks.")
632 Statistical Detection (use statistical tests to see if input is from clean distribution)
611 LID (estimates local intrinsic dimensionality and uses that to detect anomalies)
80 Transformations (Our key insight is that adversarial examples are usually sensitive to certain image transformation operations such as rotation and shifting. In contrast, a normal image is generally immune to such operations.)
49 Winning Hand (ensemble of pruned models for robustness)
38 nMutant (Use sensitivity of a sample to perturbations)
12 Feature Response Maps ("We do so by tracking adversarial perturbations in feature responses, allowing for automatic detection using average local spatial entropy.")
2 Latent Disparity Regularisation (uses regularisation term to train more robust networks)

I would roughly split these into three groups:

Using sensitivity to input augmentation
Using statistics of activations (learnt features)
Adversarial training to increase robustness

Resources

RobustBench has leaderboards for models not susceptible to adversarial attacks. Some of the techniques can probably be adapted as useful detectors.
DeepRobust is a library with both attacks and defenses (though only a few defenses)

VRehnberg commented 11 months ago

I'll probably start with Spectral/Spectre and then perhaps MagNet after that. Seems easy enough without having looked at the details yet.

ASSET requires some more work so I'll hold off on that for now.

ejnnr commented 11 months ago

I'm a bit sceptical to methods that need the original poisoned training data besides the clean datasets to train the detector.

I think such methods can still be interesting if individual failures aren't too bad, and we mainly want to avoid consistently making mistakes in the long run (what Paul calls the "low-stakes" setting). In that case, we can use deployment inputs as the mixed set of clean+poisoned---early on, we won't be able to distinguish them yet, but at some point we'll have seen enough poisoned data. A potential problem is if the poisoning rate is too low, in which case we might keep making mistakes forever with most detection methods. But I think even that is fine in many applications. For example, if deceptive AIs can only take treacherous actions in 1% of cases, that makes me more optimistic about preventing takeover (though we'd still have to prevent certain forms of coordination between AIs). Ironically, I'm less sure how helpful this is in the backdoor detection case---intuitively it does feel like you'll only see a very low poison rate during deployment for many applications, and it's still bad. But I'm sure there are deployment scenarios where you're only worried about high rates of attempted attacks.

I would roughly split these into three groups:

Using sensitivity to input augmentation Using statistics of activations (learnt features) Adversarial training to increase robustness

I think a guiding principle for choosing which ones to implement is whether they're specific to e.g. image adversarial attacks, or could also be applied to many other tasks in our benchmark. Input augmentations will often be pretty specific I assume, but generic methods like MagNet seem promising. Activation statistics could be pretty general---my sense is that they are often designed specifically with adversarial examples in mind, but could still be interesting to try them, especially ones that seem reasonable a priori for other tasks.

I think we might care less about adversarial training, since it's unclear how you'd apply that to cases like preventing measurement tampering. (And on the other hand, for deception, the main problem would be that finding the adversarial inputs might be very hard, which is a pretty separate topic from what we're doing.)

I'll probably start with Spectral/Spectre and then perhaps MagNet after that. Seems easy enough without having looked at the details yet.

ASSET requires some more work so I'll hold off on that for now.

Sounds good, I agree with the decision to start with easier to implement methods.

VRehnberg commented 7 months ago

Looking at what collections of results there are to compare against (besides original papers) I've found that these are what is covered by the three biggest sources I've found.

Backdoorbench combinations:

Datasets:
- CIFAR-10
- CIFAR-100
- GTSRB
- Tiny ImageNet
Models:
- PreAct-Resnet-18
- VGG19_BN
- ConvNext_tiny
- ViT_b_16
Backdoor attacks:
- BadNets
- Blended
- LC
- SIG
- LF
- SSBA
- WaNet
- InputAware
- Blind
- BPP
- LIRA
Poisoning ratios:
- 0.1%
- 0.5%
- 1%
- 5%
- 10%
Defenses:
- abl.py
- ac.py
- anp.py
- base.py
- bnp.py
- clp.py
- d-br.py
- d-st.py
- dbd.py
- ep.py
- fp.py
- ft-sam.py
- ft.py(?)
- i-bau.py
- mcr.py
- nab.py
- nad.py
- nc.py
- npd.py
- sau.py
- spectral.py

Asset table:

Poison ratios vary by poisoning method (based on original papers)
ResNet-18 model
CIFAR-10 dataset
Backdoor attacks:
- BadNets
- Blended
- WaNet
- ISSBA
- LC
- SAA
- Narci.
Defenses:
- Spectral
- Spectre
- Beatrix
- AC
- ABL
- Strip
- CT (Confusion Training)
- ASSET

Confusion Training table:

Datasets:
- CIFAR-10
- GTSRB
- (ImageNet)
Models:
- ResNet-18
- (VGG16)
- (MobileNetV2)
- (DenseNet121)
Poison ratios (Appendix D):
- 0.1 %
- 0.3 %
- 0.5
- 1 %
- 5 %
- 10 %
Backdoor attacks:
- Badnet
- Blend
- Trojan
- CL
- SIG
- Dynamic
- ISSBA
- TaCT
- WaNet
- AdaptivePatch
- AdaptiveBlend
Defenses:
- SentiNet
- Strip
- SS (Spectral)
- AC
- Frequency
- SCAn
- SPECTRE
- CT (Confusion Training)
- (DBD)
- (MOTH)
- (NONE)

Looking at this it seems like what we have covered now and can compare against are:

CIFAR-10, GTSRB
WaNet, (Badnet against corner)
PreActResNet-18
Spectral, Finetuning, (Spectre against QUE possibly)

Which seems a bit limited. On the other hand these papers does not seem to run into memory issues for statistical detectors so we should be able to get numbers for those as well.