Vaidehi99 / InfoDeletionAttacks

MIT License
37 stars 2 forks source link

Request of script to reproduce all experiments results in paper #5

Open pkulium opened 6 months ago

pkulium commented 6 months ago

Great work! I am writing to ask any script you used to reproduce all experimental results, since there are so many settings to make.

Looking forward to hearing from you!

pkulium commented 6 months ago

Another question about evaluate.py at line 976

if args.dummy_string: target_ids = torch.as_tensor(tok.encode(" "+request['target_true']['str'])).view(1, -1).to(device) else: target_ids = torch.as_tensor(tok.encode(" "+request['target_new']['str'])).view(1, -1).to(device)

If args.dummy_string is false, for error injection attack, why target is request['target_new']?

For example, by running Head Projection Attack with Error Injection defense as

python3 -m experiments.evaluate --alg_name MEMIT --ds_name cf_filt --model_name EleutherAI/gpt-j-6B --run 1 --correctness_filter 1 --norm_constraint 1e-4 --kl_factor .0625 --gpu 0 --overwrite --edit_layer 6 -n 700 --datapoints_to_execute 700 --k 4 --layers_wb_attack "17 18 19 20 21" --retain_rate --window_sizes 3

we have

(Pdb) request['prompt'].format(request['subject']) 'The mother tongue of Danielle Darrieux is' (Pdb) request['target_new']['str'] 'English'

but it seems that the target is supposed to be French?

Vaidehi99 commented 6 months ago

Because in error injection, we maximize the probability of incorrect answer (English in this case)

pkulium commented 6 months ago

For example, The mother tongue of Danielle Darrieux is French ---> The mother tongue of Danielle Darrieux is ***

After edit, we check topk probability of ''English'' as target in the middle layers instead of French? It seems that the definition of Attack success rate should be topk probability of ''French'' in the middle layers.

Vaidehi99 commented 6 months ago

Yeah, that's correct. It should be "French". Looks like that could be a bug. Thanks for pointing it out. I'll correct and update it.