RaphaelOlivier / robust_speech

Apache License 2.0
38 stars 14 forks source link

Questions about targeted settings of attacking End-to-end ASR. #5

Closed LetterLiGo closed 1 year ago

LetterLiGo commented 1 year ago

May I ask you some questions about the targeted settings? As many papers have studied attacking the DeepSpeech model, few papers work on attacking the ASRs based on the SpeechBrain framework. And I tried to implement a simple demo of attacking SpeechBrain-based seq2seq ASR recently. However, I found it more challenging to perform successful attacks than DeepSpeech because DeepSpeech is character-level inference, which requires the language model to compensate for the shortcomings of the acoustic model. I debugged many times and referred to your codes that leveraged the information of seq2seq loss (i.e., NLL Loss) and CTC Loss. The results were too poor to achieve the idea of targeted attacks.

RaphaelOlivier commented 1 year ago

Indeed, seq2seq models are harder to attack (in a targeted way) than encoder-only models. That's what we found, although I think it's not just related to character vs word models: neural architecture matter as well. Note that we developed robust_speech to attack models built with Speechbrain, but not just models released by the Speechbrain team. If you try attacking some of the other models (wav2vec2 for instance) you should make different observations. You can also train Deepspeech with Speechbrain and should then reproduce your results on deepspeech then.

LetterLiGo commented 1 year ago

Thanks so much for your suggestions. I'll try it.