albertaparicio / tfg-voice-conversion

Deep Learning-based Voice Conversion system
GNU General Public License v3.0
120 stars 39 forks source link

A question about "Zaska" and "dtw -b", how could I get more feature by running "compute_dtw.sh"? #4

Open HudsonHuang opened 7 years ago

HudsonHuang commented 7 years ago

I tried the solution provided by lf0_lstm.py and so. When I tried to modify the parameters in tranning, a script in /data/training/compute_dtw.sh made me confused.

` ZASKA="Zaska -P $PRM_NAME $PRM_OPT"

# Compute mfcc $DIR_REF/${FILENAME}.wav $DIR_TST/${FILENAME}.wav => mfcc/$DIR_REF/${FILENAME}.prm mfcc/$DIR_TST/${FILENAME}.prm
${ZASKA} -t RAW -x wav=msw -n . -p mfcc -F ${DIR_REF}/${FILENAME}_sil ${DIR_TST}/${FILENAME}_sil

# Align: mfcc/${DIR_REF}/${FILENAME}.prm, mfcc/${DIR_TST/${FILENAME}.prm => dtw/${DIR_REF}-${DIR_TST}/${FILENAME}.dtw
b=2
dtw -b -${b} -t mfcc/${DIR_TST} -r mfcc/${DIR_REF} -a ${DIR_DTW}/beam${b} -w -B -f -F ${FILENAME}_sil`

Running the script is diffcult, as the command "Zaska" is not exist in any package I found and the "dtw" command doesnt have the parameter of "-b" . How could I sovle it?

By the way, I want to run this script because I wanted to add more parameters on training, I modified "tfglib" and tried to build the /data/train_datatable.h5 again.

It resulting in very few harmonic elements, may need to use high-order of feature extraction and adjust the network to fit more high-order implied feature.(It seems the default training also resulting over-fitting.) In addition, the result of converted result sounds dull and low, lacks a sense of penetration, may due to the lack of high-order features harmonic elements.

albertaparicio commented 7 years ago

Dear Hudson,

Sorry for the delay in my response, I have been busy working on the seq2seq model.

Regarding the Zaska and dtw scripts, they belong to the Signal Theory and Communications department at the UPC university in Catalonia (this project is being developed for my bachelor thesis).

I have contacted the people at the department to ask them if I can distribute these scripts. I'll get back to you as soon as I have a response

Regarding the resulting sound of the system, I am aware that it does not give very accurate results. You see, the scripts you write about belong to the 'baseline' of the system. This version was developed only to have a reference level of results quality, as we have been focusing our efforts (and still are) on the sequence-to-sequence model.

If you find a way to improve this baseline, that is great news, but we are not going to work on it anymore

As always, thank you for your interest in this project

Cheers!

HudsonHuang commented 7 years ago

Dear Albert,

Thank you so much for your response. The seq2seq model is definitely a good idea.

And as a reference, you can also check up this company:https://lyrebird.ai/. They are trying to give out an API-level Voice Conversion Solution, for commercial purposes. And it seems they have a good team including Yoshua Bengio.

But as you can see, they still didn't reach a much higher quality as the Mixture Neural Network solution in your project, I mean, maybe they have set a peak level for the Voice Conversion Systems, which is still not very natural, so don't be discouraged if the seq2seq solution doesn't work much better than the Mixture Neural Network solution.

Best regards!

HudsonHuang commented 7 years ago

@MissPassenger I found that the ZASKA is an DTW toolkit developed by the UPC and,the dtw is a DTW tool inside of it. so, I am trying to instead it with mfcc and dtw code in SPTK。 like this: ` b=2 sox mfcc/${DIR_REF}/${FILENAME}_sil.wav mfcc/${DIR_REF}/${FILENAME}_sil.raw sox mfcc/${DIR_TST}/${FILENAME}_sil.wav mfcc/${DIR_TST}/${FILENAME}_sil.raw

x2x +sf < mfcc/${DIR_REF}/${FILENAME}_sil.raw | frame -l 480 -p 80 | \
    mfcc -l 480 -m 20 -s 16 > mfcc/${DIR_REF}/${FILENAME}.mfcc

x2x +sf < mfcc/${DIR_TST}/${FILENAME}_sil.raw | frame -l 480 -p 80 | \
    mfcc -l 480 -m 20 -s 16 > mfcc/${DIR_TST}/${FILENAME}.mfcc

dtw -l 480 mfcc/${DIR_REF}/${FILENAME}.mfcc < mfcc/${DIR_TST}/${FILENAME}.mfcc >> ${DIR_DTW}/${FILENAME}_ascii.dtw

x2x +af ${DIR_DTW}/${FILENAME}_ascii.dtw  ${DIR_DTW}/beam${b}/${FILENAME}.dtw

`

but the dtw command output a unreadable format for x2x command and build_datatable ansd which seem to be ASCII I use x2x +af to convert it but it fails. Any idea? Thanks.

albertaparicio commented 7 years ago

Regarding Zaska, I have contacted the people at UPC who developed this toolkit to ask if I can redistribute these programs, but I have had no answer.

Regarding the data formats, I am not aware of the format used by SPTK. In the case of the output of Zaska, the data was stored in float format with no headers, but I do not know how SPTK outputs the data.

I have checked in past commits, and if I am not mistaken, in this script I used the SPTK DTW. Maybe this can help:

https://github.com/albertaparicio/tfg-voice-conversion/blob/a4aeea2a244cf9f74ae3f03d4d7a6bb10c0a6594/data/training/align_training.sh

Albert

On 05/06/17 04:48, zhongyi huang wrote:

@MissPassenger https://github.com/misspassenger I found that the ZASKA is an DTW toolkit developed by the UPC and,the dtw is a DTW tool inside of it. so, I am trying to instead it with mfcc and dtw code in SPTK。 like this: ` b=2 sox mfcc/${DIR_REF}/${FILENAME}_sil.wav mfcc/${DIR_REF}/${FILENAME}_sil.raw sox mfcc/${DIR_TST}/${FILENAME}_sil.wav mfcc/${DIR_TST}/${FILENAME}_sil.raw

|x2x +sf < mfcc/${DIR_REF}/${FILENAME}_sil.raw | frame -l 480 -p 80 | \ mfcc -l 480 -m 20 -s 16 > mfcc/${DIR_REF}/${FILENAME}.mfcc x2x +sf < mfcc/${DIR_TST}/${FILENAME}_sil.raw | frame -l 480 -p 80 | \ mfcc -l 480 -m 20 -s 16 > mfcc/${DIR_TST}/${FILENAME}.mfcc dtw -l 480 mfcc/${DIR_REF}/${FILENAME}.mfcc < mfcc/${DIR_TST}/${FILENAME}.mfcc >> ${DIR_DTW}/${FILENAME}_ascii.dtw x2x +af ${DIR_DTW}/${FILENAME}_ascii.dtw ${DIR_DTW}/beam${b}/${FILENAME}.dtw` |

but the dtw command output a unreadable format for x2x command and build_datatable ansd which seem to be ASC11]. I use x2x +af to convert it but it fails. Any idea? Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/albertaparicio/tfg-voice-conversion/issues/4#issuecomment-306091029, or mute the thread https://github.com/notifications/unsubscribe-auth/AIfQzgColGzElIUttI4HpO5E_Y3-Ebtaks5sA2yKgaJpZM4M2fyH.

HudsonHuang commented 7 years ago

That helps a lot~ many thanks.