LJY-M / Voice-Guard

2 stars 0 forks source link

Issue About Attack_utils #1

Open Yuki-yt3 opened 4 months ago

Yuki-yt3 commented 4 months ago

Thank you very much for sharing the code for this work! However, in the attack_utils.py from data_utils import wav2mel_tensor, Transform, I meet the error that cannot find reference Transform in data_utils.py and from asr_model.find_most_unlikely_speaker import speaker_name unresolve reference 'asr_model'. Would you please provide more code details?Thanks you very much!

LJY-M commented 4 months ago

This function is the selection of the target speaker, and we refer to the idea of the least likely class in adversarial attacks for implementation.

This function is not the core code of this paper, please implement by yourself.

windforestfiremountain commented 4 months ago

Thanks for your code, I'd like to ask some details of the experiement: (1)First, how to start the project, what parameters and files should I consider iif I'd use your data_utils.py and attack_utils.py to implement the attack.py like Repostories in AttackVC. For instance, what is the parameter "init_c" look like in the function mask_wav_emb_attack , is it the alpha parameter in Algorithm 1? (2)Second, I‘d also like to ask the details about from data_utils import wav2mel_tensor, Transform. As the code described, I guess the function Transform is related to the Power Spectral Density because of the output parameter psd_transformer, is the implement similar to the codes in generate_masking_threshold.py? Besides, what are the detailed components of **kwargs? Given that psd_transformer serves as the input for the function deal_mask_wav_emb_attack_1, it is necessary to understand what kwargs entails. (3)Last but not least, Could you please explain the details of the Binary Search algorithm, as indicated in line 89 of Algorithm 1? It appears that the function deal_mask_wav_emb_attack_2 implements weighted loss instead of masking threhold loss in equation 7. However, I am not entirely clear on how the parameter 'c' (may be referred to as 'alpha' in the paper) updated in the binary search. Additionally, I'd also like to inquire about the meanings of the variables attack_flag and false_c respectively, is it consider about the scenarios considering no attacks or no masking_threhold?

Many thanks for the open-sourcing code again, and looking forward to your reply!

LJY-M commented 4 months ago

(1) init_c = 1.0c is alpha; (2) The implementation of the Transformer class has been submitted; model, config, attr, device = load_model(model_dir), **kwargs=**config["preprocess"]; (3) Please refer to Equation 8; You can implement the alphas however you like. It only affects efficiency. attack_flag represents the determination of whether the defense is successful or not. false_c represents the value of alpha when the defense fails.

This algorithm is no longer maintained, please implement it as you wish.

windforestfiremountain commented 1 month ago

Thanks to the code and parameters provided, I ran through the code based on the following steps: (1) Comment the lines of code from asr_model.find_most_unlikely_speaker import speaker_name, and instead predefining my list variable speaker_name which contains multiple speakers and a dictionary path variable speaker_embedding_dict consists of these speakers and their representations; (2) On the basis of default parameters in AttackVC and kwargs = config["preprocess"] you offered, I got three key variables wav, theta_xs, psd_max of the origin input wave vc_tgt offering the speaker information; (3) After changing these key variables into Tensor,I used the function mask_wav_emb_attack and got the output final_adv_inp, which is a Tensor shaped just like the wav input;

After these steps, however, I turned final_adv_inp into numpy, and got a noised wav by using soundfile library, no matter sample rate for 16kHz or 24kHz in configs provided in AttackVC. I think it's different from AttackVC cause the adv_inp of that algorithm sounds like the ori_input wave. Therefore, I'd like to ask for a help based on two hypothesis: (1) loss in Logs as follows, and I'm afraid the loss_emb_l2 is too high step 1 : 1500 || loss_emb_l2 : 5.346188545227051 || loss_th : 39.75281083223775; attack_step_1 || attack_flag : True || eps : 0.1 || step : 1500 step 2 : 700 || loss_emb_l2 : 4.801605701446533 || loss_th : 37.85937347082802 attack_step_2 || attack_flag : True || c : 8.0 || step : 750 attack_step || eps : 0.1 attack_step_2 || c : 8.0

(2) I've also seen class InversePreEmphasis in https://github.com/LJY-M/Voice-Guard/blob/main/data_utils.py#L37, is there any need to inversePreEmphasis adv_inp cause parameter vc_tgt is preemped in https://github.com/LJY-M/Voice-Guard/blob/main/attack_utils.py#L46, though no influences to the vc_tgt input in deal_mask_wav_emb_attack_1 and deal_mask_wav_emb_attack_2

Looking foward to the reply, many thanks!