Is it suitable for other languages?

Hello, Thanks for your interest. I have many answer to make here: Let me start by promoting the huggingface spcace that allows you to test the models that I share without installation. Link here: https://huggingface.co/spaces/Champion/SA-toolkit No audio files will be save by me, execution is in the huggingface cloud, but I don't know if they store the audio data (very unlikely IMO). You can also use docker:

docker run -it -p 7860:7860 --platform=linux/amd64 registry.hf.space/champion-sa-toolkit:latest python app.py

but those are CPU only options. With those, you can check the generated audio by yourself very easily.

Then about the natural sounding voice, there are some tricks that I didn't implemented in the provided model that will make the model generate more natural sounding voice (I did those for my Thesis, but this toolkit is an re-implementation). For the sake of documentation, here are most of them: - f0 stat speaker norm; F0 quant; ASR-bn extraction in subsamples during hifigan training to increase data (cache_functions = ["none"] or get_f0 only in hifigan conf). So the outputted speech is not the most natural sounding, I will let you be the judge.

As the model is trained on English, it performs the best in this language, however, it can work with other languages too, but the linguistic content (what someone says) will get deteriorated, especially if the source audio quality is already bad.

About the model, the stronger they are at anonymizing the speaker, the more they are specific to the English language. German is not that far away from English, it could work.

On previous version of the toolkit, we trained an anonymization model for french using MLS (http://openslr.org/94/) see some old-not-working code here: https://github.com/deep-privacy/SA-toolkit/tree/master/egs/asr/mls MLS has a German section that could be adapted with the toolkit (substantial amount of work).

Given the list here: https://huggingface.co/spaces/Champion/SA-toolkit, here are some comment:

'hifigan_bn_tdnnf_wav2vec2_vq_48_v1': The best for privacy (harder to invert anonymization), good for clean speech (close mic) and English.
'hifigan_bn_tdnnf_wav2vec2_100h_aug_v1': The best for natural speech generation, not a lot of privacy guarantee (easy to invert anonymization).
The others are more for research purposes, but 'hifigan_bn_tdnnf_600h_aug_v1' can be interesting (similar to 'hifigan_bn_tdnnf_wav2vec2_100h_aug_v1')

Depending on the threat level that you are considering, a weak anonymization could be enough, otherwise, if you threat level is very high, 'hifigan_bn_tdnnf_wav2vec2_vq_48_v1' is the best model that I can provide. (number that does not mean anything: BIG estimate ~~~70% anonymization, 100% being the best anonymization but not achievable with this toolkit, and by other toolkit too without significant lost of utility)! Research is still active in the domain. If you are interested in the domain/(real)-metrics checkout my thesis. Best.

deep-privacy / SA-toolkit

Is it suitable for other languages? #2