Generated audio not clear

wennyramadha commented 1 month ago

Hello, may I ask for your guidance in generating the anonymized audio? I can run your code with the default setting but the output audio is not clear.

Here is the example when I generate the audio with a sampling rate of 16khz https://drive.google.com/file/d/17bv8ZMYrOmohT8T61G3jg16udOoWiO05/view?usp=drive_link

here is the audio with a sampling rate of 48khz https://drive.google.com/file/d/1yQ56s5QGJuFDItTFO_mJ3hPKnyvFegHS/view?usp=drive_link

SarinaMeyer commented 1 month ago

Hi, I don't have the permission to access the audio files, could you please change them? Also, could you either add the original audio file as well or give its name (if it is from a common dataset like LibriSpeech or VCTK)?

wennyramadha commented 1 month ago

Hi, I have updated the permission access. All the example data is from librispeech_test. Below is the original audio https://drive.google.com/file/d/1vvyzBbN-sK2_m4moVhJvSMe4oUeyGcK8/view?usp=drive_link

Thank you so much

wennyramadha commented 1 month ago

I use the prosody_cloning source code

SarinaMeyer commented 1 month ago

Thanks, I can access them now.

This definitely sounds bad, worse than in my experiments. Could you share the recognized transcript from this utterance? Also, are all audios like this? It might be that this is a problem of this particular speaker embedding, maybe it would sound better if you run the anonymization again with a new speaker selection (a new speaker selection should be performed if you delete the old result files in the speaker_embeddings folder).

wennyramadha commented 1 month ago

The transcription result for this audio (121-123852-0002.wav) is in phonetic format, right? below is the snapshot example from all the audio (I only use about 58 audios as sample, all are LibriSpeech test data) Screen Shot 2024-10-16 at 9 39 32 PM

Actually, when I run the script for inference using run_inference.py, I got following warning.

Screen Shot 2024-10-16 at 9 52 06 PM

Thank you for your response. I will try your suggestion.

SarinaMeyer commented 1 month ago

Yes, the transcription is in phonetic format and seem to be correct. So the problem is not at the ASR's end. If you only used 58 samples, are they all of the same speaker? You will get the same output voice for the same input speaker, so maybe try testing this with a more diverse (speaker-wise) subset to test whether you see the same effect for different voices.

The warning should not matter, you can ignore it.

wennyramadha commented 1 month ago

The 58 samples are from 2 different speakers. Thank you for your suggestion, I will try it.

wennyramadha commented 1 month ago

Hi, I want to update about this issue. Currently, I experienced the same thing. The output speech sounds the same eventhough I used all data.

Actually, I also got problem about "pretrained_models" as in this issue https://github.com/DigitalPhonetics/speaker-anonymization/issues/2 and then I change it like this: Screen Shot 2024-10-22 at 5 44 06 PM

I use this model because later in the line 234, only this model that has 'style_emb_func'

Screen Shot 2024-10-22 at 5 45 32 PM

Is it the cause of the problem?

SarinaMeyer commented 1 month ago

It is weird that the script even attempted to find the model in pretrained_models. In GANAnonymizer, the variable self.embed_model_path (which is then given for the variable model_path in the speaker embedding extraction) is overwritten with the path from the settings file, the one that you now set manually. The only idea that I have is that something went wrong during this load_parameter function. Could you check if this settings.json is loaded correctly?

SarinaMeyer commented 1 month ago

I have to admit though that this code is rather old and I might have fixed some bugs in other versions of the code that I might have forgotten to fix here too. I would appreciate your help in trying to figure out your issue but I understand that this might be too time-consuming for you.

You can find a working version of this model in the latest Voice Privacy Challenge. We included this model as baseline B3, in the code under the tag sttts . Compared to the default setting we have here, the model in the challenge includes prosody modifications per default, but you can disable it by commenting out the part with the prosody anonymization in the config. Alternatively, you can use the code in our VoicePAT toolkit which was the basis on which the code of the challenge was restructured. The main branch underwent some changes during the challenge development, but you can find a working version in the develop branch (which will be moved to the main branch soon).

In any way, I recommend you to use either the Voice Privacy Challenge 2024 or VoicePAT for evaluation. They contain several improvements compared to the the evaluation scripts of the Voice Privacy Challenge 2022 or 2020, which are still included in this repository.

DigitalPhonetics / speaker-anonymization

Generated audio not clear #6