Open wennyramadha opened 1 month ago
Hi, I don't have the permission to access the audio files, could you please change them? Also, could you either add the original audio file as well or give its name (if it is from a common dataset like LibriSpeech or VCTK)?
Hi, I have updated the permission access. All the example data is from librispeech_test. Below is the original audio https://drive.google.com/file/d/1vvyzBbN-sK2_m4moVhJvSMe4oUeyGcK8/view?usp=drive_link
Thank you so much
I use the prosody_cloning source code
Thanks, I can access them now.
This definitely sounds bad, worse than in my experiments. Could you share the recognized transcript from this utterance? Also, are all audios like this? It might be that this is a problem of this particular speaker embedding, maybe it would sound better if you run the anonymization again with a new speaker selection (a new speaker selection should be performed if you delete the old result files in the speaker_embeddings folder).
The transcription result for this audio (121-123852-0002.wav) is in phonetic format, right? below is the snapshot example from all the audio (I only use about 58 audios as sample, all are LibriSpeech test data)
Actually, when I run the script for inference using run_inference.py, I got following warning.
Thank you for your response. I will try your suggestion.
Yes, the transcription is in phonetic format and seem to be correct. So the problem is not at the ASR's end. If you only used 58 samples, are they all of the same speaker? You will get the same output voice for the same input speaker, so maybe try testing this with a more diverse (speaker-wise) subset to test whether you see the same effect for different voices.
The warning should not matter, you can ignore it.
The 58 samples are from 2 different speakers. Thank you for your suggestion, I will try it.
Hi, I want to update about this issue. Currently, I experienced the same thing. The output speech sounds the same eventhough I used all data.
Actually, I also got problem about "pretrained_models" as in this issue https://github.com/DigitalPhonetics/speaker-anonymization/issues/2 and then I change it like this:
I use this model because later in the line 234, only this model that has 'style_emb_func'
Is it the cause of the problem?
It is weird that the script even attempted to find the model in pretrained_models. In GANAnonymizer, the variable self.embed_model_path (which is then given for the variable model_path in the speaker embedding extraction) is overwritten with the path from the settings file, the one that you now set manually. The only idea that I have is that something went wrong during this load_parameter function. Could you check if this settings.json is loaded correctly?
I have to admit though that this code is rather old and I might have fixed some bugs in other versions of the code that I might have forgotten to fix here too. I would appreciate your help in trying to figure out your issue but I understand that this might be too time-consuming for you.
You can find a working version of this model in the latest Voice Privacy Challenge. We included this model as baseline B3, in the code under the tag sttts . Compared to the default setting we have here, the model in the challenge includes prosody modifications per default, but you can disable it by commenting out the part with the prosody anonymization in the config. Alternatively, you can use the code in our VoicePAT toolkit which was the basis on which the code of the challenge was restructured. The main branch underwent some changes during the challenge development, but you can find a working version in the develop branch (which will be moved to the main branch soon).
In any way, I recommend you to use either the Voice Privacy Challenge 2024 or VoicePAT for evaluation. They contain several improvements compared to the the evaluation scripts of the Voice Privacy Challenge 2022 or 2020, which are still included in this repository.
Hello, may I ask for your guidance in generating the anonymized audio? I can run your code with the default setting but the output audio is not clear.
Here is the example when I generate the audio with a sampling rate of 16khz https://drive.google.com/file/d/17bv8ZMYrOmohT8T61G3jg16udOoWiO05/view?usp=drive_link
here is the audio with a sampling rate of 48khz https://drive.google.com/file/d/1yQ56s5QGJuFDItTFO_mJ3hPKnyvFegHS/view?usp=drive_link