CSTR-Edinburgh / merlin

This is now the official location of the Merlin project.
http://www.cstr.ed.ac.uk/projects/merlin/
Apache License 2.0
1.31k stars 441 forks source link

Training acoustic model with 48000 Hz? #247

Closed dreamk73 closed 7 years ago

dreamk73 commented 7 years ago

I am trying to train an acoustic model with 48000Hz acoustic data using the WORLD vocoder. I have changed the parameters in the conf file to reflect this change.

[Outputs]

dX should be 3 times X

mgc: 60 dmgc: 180 bap: 5 <- was 1 at 16000 Hz dbap: 15 <- was 3 at 16000 Hz lf0: 1 dlf0: 3

[Waveform] test_synth_dir: None

options: WORLD or STRAIGHT

vocoder_type: WORLD samplerate: 48000 framelength: 2048

Frequency warping coefficient used to compress the spectral envelope into MGC (or MCEP)

fw_alpha: 0.77 minimum_phase_order: 1023

I extracted the parameters using merlin/misc/scripts/vocoder/world/extract_feature_for_merlin.py giving it the correct sample rate. Everything trains correctly but the output audio is not speech. It shows a steadily increasing buzzing sound.

Any idea what I am doing wrong here?

ronanki commented 7 years ago
dreamk73 commented 7 years ago

I ran the copy synthesis script and it sounded great. The latest build_your_own_voice/s1/conf does not have a global_settings.cfg file. But slt_arctic does so I used that and created a new conf file. I double-checked that I ran the extraction script with 48000 Hz. The result is the same as before: 2017-09-22 10:21:47,186 INFO main.train_DNN: epoch 16, validation error 199.470139, train error 190.619293 time spent 24.34 2017-09-22 10:21:47,186 DEBUG main.train_DNN: stopping early 2017-09-22 10:21:47,186 INFO main.train_DNN: overall training time: 6.71m validation error 197.911133 2017-09-22 10:26:40,168 INFO main : Develop: DNN -- MCD: 7.318 dB; BAP: 0.333 dB; F0:- RMSE: 42.471 Hz; CORR: 0.080; VUV: 27.218% 2017-09-22 09:18:34,915 INFO main : Test : DNN -- MCD: 7.198 dB; BAP: 0.336 dB; F0:- RMSE: 40.351 Hz; CORR: 0.020; VUV: 26.419%

I have lowered the learning rate to 0.001 but that doesn't improve anything. For comparison, these are the results for the 16kHz versions of these wavefiles: 2017-09-15 12:50:19,687 INFO main.train_DNN: epoch 25, validation error 160.428696, train error 153.798431 time spent 20.26 2017-09-15 12:50:24,046 INFO main.train_DNN: overall training time: 10.39m validation error 160.428696 2017-09-15 12:50:24,052 INFO main: generating from DNN 2017-09-15 12:50:35,498 DEBUG main: denormalising generated output using method MVN 2017-09-15 12:50:36,674 INFO main: calculating MCD 2017-09-15 12:50:37,320 INFO main: Develop: DNN -- MCD: 4.711 dB; BAP: 0.172 dB; F0:- RMSE: 20.303 Hz; CORR: 0.754; VUV: 4.517% 2017-09-15 12:50:37,320 INFO main: Test : DNN -- MCD: 4.709 dB; BAP: 0.172 dB; F0:- RMSE: 19.609 Hz; CORR: 0.748; VUV: 4.508%

dreamk73 commented 7 years ago

I think it may have to do with my waveform conversion using sox. The audio files are headerless, so I used sox to convert them to give them a header and make them so that World can process them. I used:

sox -t raw -r 48000 -c 1 -b 32 -e floating-point --norm=-3 in.wav -r 48000 -c 1 -b 16 -e signed-integer out.wav

RasmusD commented 7 years ago

You could try using SPTK's rawtowav instead of sox?

I train 48khz voices using merlin all the time and they work perfectly fine.

2017-09-22 23:28 GMT+09:00 Esther Judd-Klabbers notifications@github.com:

I think it may have to do with my waveform conversion using sox. The audio files are headerless, so I used sox to convert them to give them a header and make them so that World can process them. I used:

sox -t raw -r 48000 -c 1 -b 32 -e floating-point --norm=-3 in.wav -r 48000 -c 1 -b 16 -e signed-integer out.wav

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CSTR-Edinburgh/merlin/issues/247#issuecomment-331462835, or mute the thread https://github.com/notifications/unsubscribe-auth/AEyDpyaLRcvicnVtNKHe6gKqI0p-pEfTks5sk8QbgaJpZM4PfE4K .

dreamk73 commented 7 years ago

Thanks @RasmusD. rawtowav does not work for some reason. The output wav looks and sound bad. I have included a waveform that I use as input. Remove the .txt ending, which I only added to be able to upload it here.

When I run soxi on it I get: soxi WARN wav: wave header missing extended part of fmt chunk

Input File : 'sent001.wav' Channels : 1 Sample Rate : 48000 Precision : 25-bit Duration : 00:00:09.39 = 450560 samples ~ 704 CDDA sectors File Size : 1.80M Bit Rate : 1.54M Sample Encoding: 32-bit Floating Point PCM

I have tried using ch_wave, sox, rawtowav. I have tried to make it 16-bit signed-integer PCM. Nothing seems to work to get Merlin to train models with them.

sent001.wav.txt

simonkingedinburgh commented 7 years ago

Esther Judd-Klabbers wrote:

Precision : 25-bit That doesn't look right

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

dreamk73 commented 7 years ago

I have no idea how it thinks the precision is 25-bit, I agree with you that it looks strange. My boss converted the files for me to another format and as you can see below from the soxi command, this looks better. But I still don't get any useable Merlin output.

Input File : '/data/esther/rs_merlin_data/sophie/wav48_kare/sn001_sent001.wav' Channels : 1 Sample Rate : 48000 Precision : 16-bit Duration : 00:00:09.39 = 450560 samples ~ 704 CDDA sectors File Size : 901k Bit Rate : 768k Sample Encoding: 16-bit Signed Integer PCM

sent001_2.wav.txt

dreamk73 commented 7 years ago

Some files give me this error when extracting acoustic features using the WORLD vocoder:

x2x : error: input data is over the range of type 'float'!

RasmusD commented 7 years ago

Did you use rawtowav or raw2wav?

raw2wav has a bug in it that makes it output bad wav files. Could that be it?

2017-09-25 22:42 GMT+09:00 Esther Judd-Klabbers notifications@github.com:

Some files give me this error when extracting acoustic features using the WORLD vocoder:

x2x : error: input data is over the range of type 'float'!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CSTR-Edinburgh/merlin/issues/247#issuecomment-331884821, or mute the thread https://github.com/notifications/unsubscribe-auth/AEyDp1vdF7ZFXujsh-O29tuYo58fUS45ks5sl63DgaJpZM4PfE4K .

dreamk73 commented 7 years ago

My SPTK version only has rawtowav.

But the problem has been solved now. It wasn't in the conversion at all, but a stupid mistake on our end. For this particular speaker, we had recorded with a long silence at the start to overcome difficulties with the connection and when we created the 22kHz versions and label files, we cropped them. So in training models in Merlin with the 48kHz, a lot of the labels actually had silence in them.

I now use sox to convert the unheadered cropped 48kHz files using: sox -V4 -v 0.8 -t raw -r 48000 -c 1 -b 32 -e floating-point in.wav -r 48000 -c 1 -b 16 -e signed-integer out.wav

Sox still complains about clipping in a small number of samples (1-3 per file), but I can train Merlin now with this data.