andabi / deep-voice-conversion

Deep neural networks for voice conversion (voice style transfer) in Tensorflow
MIT License
3.91k stars 845 forks source link

Tips for generating good results. #62

Open carlfm01 opened 5 years ago

carlfm01 commented 5 years ago

@andabi can you tell me what you did exactly using the tools? A normalization is required for the train and target audio? I used 1 script to see the min and max db, but I'm getting the noise that causes sounds like robots in movies. How do you know the exact values for n_mels and n_mfcc, for spanish I'm using a default TTS of windows, the net 1 is reaching to 95% acc.

Loss for the net2 net2

Needs more training?

results.zip

sailor88128 commented 5 years ago

Hi @carlfm01 , I just hear your spanish case. If it's trained with dataset which has a phoneme label?

carlfm01 commented 5 years ago

Hi @sailor88128 I built it by my own using a Windows existing TTS in C#, Windows speech api has an event that raises when a phoeneme is reached by the TTS, so I wrote a code that writes the phn files using the windows api. In any case if you want label the data or generate it from an existing TTS remember to multiply the current phoneme time per the sampling date (16000) in my case. Remember this aproach only will work with the voice of the TTS used for the dataset. I can share the code and the dataset if you want.

sailor88128 commented 5 years ago

Sounds interesting! @carlfm01 , I would like to learn and try your method, pls share with we. Email: weiqingkai1@163.com. Thanks a lot~

carlfm01 commented 5 years ago

@sailor88128 Sent, let me know any questions. I'll make a repo later.

sailor88128 commented 5 years ago

Thanks, I'll try. @carlfm01

sailor88128 commented 5 years ago

I have a question about the generated data, @carlfm01 . For the Net1 training, TIMIT dataset used. TIMIT contains 630 speakers' utterances and corresponding phones that speaks similar sentences. The TTS tool only contains the voice of 'Helena' or less than ten. Are these enough to train the phoneme classifier?

carlfm01 commented 5 years ago

Hi @sailor88128, If you want many to one then the TTS is not an option, to make a many to one requires a TIMIT kind dataset, even more depending on the languague. The TTS aproach only allows to convert that TTS voice into the target style voice. I was really curious to see if the style transfer can convert the horrible sound of the TTS into more natural voice, and to be honest the result is interesting. Now im trying to minimize the roboting sound, but in features of the languague is really amazing.

jefhai commented 5 years ago

@carlfm01 I agree the results sound like a person or robot talking into a fan. https://www.youtube.com/watch?v=vNBajseh6sE

Id love to figure out how to de-noise these results into clearer audio.

carlfm01 commented 5 years ago

After having hard time with Linux and cuda versions here's other result, I increased the win lenght and hop, for Spanish looks like the vocals are litle longer, algo changed the lr from lr = 0.0003 to lr = 0.0001 mresults.zip

sailor88128 commented 5 years ago

Sounds much better, great! @carlfm01

carlfm01 commented 5 years ago

Thanks @sailor88128, still not time to create your dataset, I got access to a P100 with Linux on Azure but is much slower than Windows with a k80, same version and I don't know where is the bottleneck. There's considerable amount of operations on the CPU, I'm trying to allocate operations manually on the GPU then if it crash look for options or newer tensorflow operations that are allowed in the GPU.

@sailor88128 can you add log_device_placement=True to tf.ConfigProto in train1 example:

session_conf = tf.ConfigProto(
        log_device_placement=True, 
        gpu_options=tf.GPUOptions(
            allow_growth=False,
        ), allow_soft_placement=False)#allow_soft_placement=False

to see where the ops allocate, also your version of py and tf, please.

For python 3.6 and tf 1.11.0 op_alloc.txt

sailor88128 commented 5 years ago

Hi @carlfm01 , python 2.7 and tf 1.8.0. The log file is as below. Thanks for your help of data. Maybe I can try to generate Chinese data for Net1 myself firstly, could you please provide a more detailed introduction?

op_alloc-py2.txt

robd003 commented 5 years ago

@carlfm01 Did you ever figure out how to move more operations to the P100 GPU instead of the CPU on Linux?

carlfm01 commented 5 years ago

Hi @robd003, no. I tried but failed, the problem is that the net 2 is using types that only the CPU can execute.

carlfm01 commented 5 years ago

Hi @sailor88128 this may interest you https://github.com/open-speech/speech-aligner

miaoYuanyuan commented 5 years ago

Hi @carlfm01 , python 2.7 and tf 1.8.0. The log file is as below. Thanks for your help of data. Maybe I can try to generate Chinese data for Net1 myself firstly, could you please provide a more detailed introduction?

op_alloc-py2.txt

Do you have any Chinese corpus for phoneme classification?

Huishou commented 5 years ago

@carlfm01 ,Thanks for your help ,I change my cunputer, then I get a result with i9 9900k , RTX 2080Ti in Windows 10 by your code. But it take me much long time. It sounds not very good. I wanna to change the language into my native language——Norway,Which position do I need to change? Just change the phns into Norsk phoneme is ok ? and change the datasets into Norsk for train1 and train2 ?

Besides , I think this code just made a one-to-one conversion,I saw the arctic is only above a man and a woman , we just convert them ? How do you consider the train1 and the train2 ?

Look forward to your reply!Tusen takk!

jefhai commented 5 years ago

Just use AWS- skip buying an expensive computer if you’re just fooling around with the code. And not doing production work. Sounds like you had access to the top of the line consumer PC. You could spin up some machine learning clusters and save some money from buying a more pricey computer

Sent from my iPhone

On Jan 30, 2019, at 8:16 AM, Huishou notifications@github.com wrote:

@carlfm01 ,Thanks for your help ,I change my cunputer, then I get a result with i9 9900k , RTX 2080Ti in Windows 10 by your code. But it take me much long time. It sounds not very good. I wanna to change the language into my native language——Norway,Which position do I need to change? Just change the phns into Norsk phoneme is ok ? and change the datasets into Norsk for train1 and train2 ?

Besides , I think this code just made a one-to-one conversion,I saw the arctic is only above a man and a woman , we just convert them ? How do you consider the train1 and the train2 ?

Look forward to your reply!Tusen takk!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

carlfm01 commented 5 years ago

Hi @Huishou, to use other lang you need to use a dataset like TIMIT for net 1, and for net 2 a good amount of audios of your target voice. Then yes, change the phonemes used on your dataset or the net1. In my case for spanish it was one-to-one, my goal was to convert Microsoft helena TTS into more natural voice, as long as I know there is no timit like for spanish to make it work many to one. Yes, the results sounds bad, I couldn't achieve the author results :/

Huishou commented 5 years ago

Hi @carlfm01 now, I still have two questions to disturb you. One is that I could not find the data like TIMIT with my lang for the net1,How should I do to use this code for CV ? The other is that when I observed the results, I found that the output voice was all two seconds, and some words seemed to be truncated, not completely finished. For example, the source speaker has 3 seconds of sound, and the result only contains 2 seconds, which means that the conversion is incomplete. How should I modify the code in order that the speech can be completely converted? Look forward to your reply!

carlfm01 commented 5 years ago

Hi @Huishou yes, is hard to find TIMIT like dataset for other langs, I don't know what do you mean with CV, common voice from Mozilla? To increase the time you have to change the value of duration here https://github.com/carlfm01/deep-voice-conversion/blob/5de50b955c8dd37d0948c0cc1bc965fa515aed88/params.py#L25

Huishou commented 5 years ago

Hi @carlfm01, I am very grateful for your help. I mistakenly wrote VC (voice convertion) as CV. Now, what should I do to use this code for other langs convertion ? Just one to one is ok ! I remember that you convert Microsoft helena TTS into more natural voice. I pray for you to give me some help and advice, thank you!

carlfm01 commented 5 years ago

@Huishou yes, using the Helena TTS I've created a dataset with the phonemes and speech aligned. If you want to do the same use the windows tts for your lang and see if it generates the phonemes with its time. Then use the target voice that you like to train net2. take a look to librivox

Huishou commented 5 years ago

@carlfm01,thank for your reply,but I could not to understand how to use the TTS system to create a dataset with the phonemes and speech aligned. Can you share the code to me? I am a fresh in speech synthesis field! Thanks much!

carlfm01 commented 5 years ago

Hi @Huishou, the code is C#, it is ok? Where I can send it to you?

Huishou commented 5 years ago

hi,@carlfm01, thank you very much , you can send it to my email 13936187167@163.com. thank you!

Huishou commented 5 years ago

hi,@carlfm01,my Email is 13936187167@163.com . I think my mailbox should not have problem. But I haven't received your email yet. I don’t know what 's happening now.

carlfm01 commented 5 years ago

Hi, just sent the email, sorry for the delay :p