Sleeper Agent get_poison_indices() may return stale array

swsuggs commented 1 year ago

Describe the bug

In Sleeper Agent, get_poison_indices() returns self.indices_poison (link); however, the poisoning procedure does not update this variable to the final indices used.

Details

This attack makes max_trials attempts to poison the data. Each time, a new self.indices_poison is sampled (L201-208). If this attempt is the most successful so far, then best_indices_poison is set to self.indices_poison (L231). But at the end of all the trials, self.indices_poison still points to the indices from the latest attempt, even if the best attempt was earlier.

Expected behavior

At the end of the poisoning loop, after all poisoning attempts are complete (L237), set self.indices_poison = best_indices_poison.

beat-buesser commented 1 year ago

Hi @swsuggs Thank you very much raising this issue!

@monshri What do you think?

monshri commented 1 year ago

Submitted PR with the fix discussed with Sterling - 1952

Trusted-AI / adversarial-robustness-toolbox

Sleeper Agent get_poison_indices() may return stale array #1952