cmusphinx / g2p-seq2seq

G2P with Tensorflow
Other
667 stars 196 forks source link

The train and dev dict files are not generated always? #122

Open loretoparisi opened 6 years ago

loretoparisi commented 6 years ago

I did a training of the standard CMU Dict and I get in the data/dict folder the dictionaries as expected:

root@800125bf05cd:~# ls data/dict/cmudict/
cmudict.dict  cmudict.dict.dev.preprocessed  cmudict.dict.train.preprocessed  cmudict.phones  cmudict.symbols  cmudict.vp

I have then started the training of my CMU 2 IPA dict, that actually works in some way and ends up with the testing as well:

g2p-seq2seq --evaluate data/dict/cmudict-ipa/cmudict-ipa.dict --model_dir data/models/cmudict-ipa
...
[2018-05-09 15:13:06,164] Inference results OUTPUT: z ɪ u: g ə ə əʊ v  z
INFO:tensorflow:Inference results INPUT: zywicki
[2018-05-09 15:13:06,165] Inference results INPUT: zywicki
INFO:tensorflow:Inference results OUTPUT: z ɪ w ɪ k ɪ
[2018-05-09 15:13:06,165] Inference results OUTPUT: z ɪ w ɪ k ɪ
Words: 133286
Errors: 487
WER: 0.004
Accuracy: 0.996

But when I check the dict folder I cannot see those files, I just have my CMU-IPA dict file.

nurtas-m commented 6 years ago

Hello, @loretoparisi What files you cannot find in your dict folder? Do you mean, that there were some of your dictionary files, and after the training was ended some of them were erased?

loretoparisi commented 6 years ago

@nurtas-m so I'm missing the files generated in the data/dict folders when then training of my CMU-IPA dict ends. I have only the checkpoint model files:

:~/docker/g2p-seq2seq/data/dict/cmudict-ipa$ ls -l
.
..
~/docker/g2p-seq2seq/data/models/cmudict-ipa-256$ ls
checkpoint                                   model.ckpt-148001.index                model.ckpt-42001.data-00001-of-00002
eval                                         model.ckpt-148001.meta                 model.ckpt-42001.index
eval.preprocessed                            model.ckpt-162001.data-00000-of-00002  model.ckpt-42001.meta
events.out.tfevents.1525880491.800125bf05cd  model.ckpt-162001.data-00001-of-00002  model.ckpt-54001.data-00000-of-00002
graph.pbtxt                                  model.ckpt-162001.index                model.ckpt-54001.data-00001-of-00002
model.ckpt-108001.data-00000-of-00002        model.ckpt-162001.meta                 model.ckpt-54001.index
model.ckpt-108001.data-00001-of-00002        model.ckpt-176001.data-00000-of-00002  model.ckpt-54001.meta
model.ckpt-108001.index                      model.ckpt-176001.data-00001-of-00002  model.ckpt-68001.data-00000-of-00002
model.ckpt-108001.meta                       model.ckpt-176001.index                model.ckpt-68001.data-00001-of-00002
model.ckpt-122001.data-00000-of-00002        model.ckpt-176001.meta                 model.ckpt-68001.index
model.ckpt-122001.data-00001-of-00002        model.ckpt-190001.data-00000-of-00002  model.ckpt-68001.meta
model.ckpt-122001.index                      model.ckpt-190001.data-00001-of-00002  model.ckpt-82001.data-00000-of-00002
model.ckpt-122001.meta                       model.ckpt-190001.index                model.ckpt-82001.data-00001-of-00002
model.ckpt-136001.data-00000-of-00002        model.ckpt-190001.meta                 model.ckpt-82001.index
model.ckpt-136001.data-00001-of-00002        model.ckpt-200000.data-00000-of-00002  model.ckpt-82001.meta
model.ckpt-136001.index                      model.ckpt-200000.data-00001-of-00002  model.ckpt-96001.data-00000-of-00002
model.ckpt-136001.meta                       model.ckpt-200000.index                model.ckpt-96001.data-00001-of-00002
model.ckpt-14001.data-00000-of-00002         model.ckpt-200000.meta                 model.ckpt-96001.index
model.ckpt-14001.data-00001-of-00002         model.ckpt-28001.data-00000-of-00002   model.ckpt-96001.meta
model.ckpt-14001.index                       model.ckpt-28001.data-00001-of-00002   model.params
model.ckpt-14001.meta                        model.ckpt-28001.index                 train.preprocessed
model.ckpt-148001.data-00000-of-00002        model.ckpt-28001.meta                  vocab.g2p
model.ckpt-148001.data-00001-of-00002        model.ckpt-42001.data-00000-of-00002
nurtas-m commented 6 years ago

When you launch the training mode, the program creates a new model directory that you set up for the "--model_dir" flag (in your case it will be ~/docker/g2p-seq2seq/data/models/cmudict-ipa directory). If the directory you set up to the flag "--model_dir" is already exists, then the program checks if there "checkpoint" file in that folder. If the "checkpoint" file found, the program will load the model that pointed out in "checkpoint" file and continue to train it. Otherwise, it will create new model, param, vocab and graph files: model.ckpt-*** model.params graph.pbtxt vocab.g2p Also, the program creates some extensive files that it will use only during training (like "eval.preprocessed", "train.preprocessed"). If you launch the training mode with "--reinit" flag, the program will delete the folder that you set up to the "--model_dir" flag (if the folder exists) and creates new folder with the same name.

loretoparisi commented 6 years ago

@nurtas-m ok thanks. Question. If the input dictionary will be split into train and test, when I run the evaluation, how to use the test dictionary only i.e. the WER results of the --evaluate is on the training or on the test set? Thank you.

nurtas-m commented 6 years ago

If you have just one file and don't want to split it to train, development and test sets by yourself, you need to launch the program without any "--valid" and "--test" flags:

$ g2p-seq2seq --model_dir your/new/model/directory --train data_folder/cmudict.dict

The program will create in your data folder additional files obtained from the file that you set up to the "--train" flag, like: cmudict.dict.part.train cmudict.dict.part.dev cmudict.dict.part.test

After the training is over, you can set up "cmudict.dict.part.test" file to the "--evaluate" flag and check for word error rate:

$ g2p-seq2seq --model_dir your/new/model/directory --evaluate data_folder/cmudict.dict.test

nurtas-m commented 6 years ago

Also, if you set up "--cleanup" flag during training, the program will create cleaned up files without stress and comments: cmudict.dict.part.train.cleanup cmudict.dict.part.dev.cleanup cmudict.dict.part.test.cleanup

So, if you have an initial dictionary file with the lines like: ababa AH0 B AA1 B AH0 ababa(2) AA1 B AH0 B AH0 d'artagnan D AH0 R T AE1 NG Y AH0 N # foreign french

you will receive cleaned up files with the following lines: ababa AH B AA B AH ababa AA B AH B AH d'artagnan D AH R T AE NG Y AH N

loretoparisi commented 6 years ago

@nurtas-m ok thanks a lot. The last problem is that when I do a training from scratch I cannot see the files

cmudict.dict.part.train
cmudict.dict.part.dev
cmudict.dict.part.test

in the data/dict/cmudict-ipa/ folder. I'm not specifying any flag a part the default ones:

g2p-seq2seq --train data/dict/cmudict-ipa/cmudict-ipa.dict --model_dir data/models/cmudict-ipa
nurtas-m commented 6 years ago

What version of g2p-seq2seq you have installed? We add the possibility of splitting the datasets starting from ver 6.1.4a0.

loretoparisi commented 6 years ago

I'm on the latest version (since I need the fix of the inference and frozen model stuff...).

nurtas-m commented 6 years ago

I don't know why in your case the program doesn't split an initial dictionary into 3 datasets. Ok, can you, please, set up "--cleanup" flag during training: $ g2p-seq2seq --model_dir data/models/cmudict-ipa --train data/dict/cmudict-ipa/cmudict-ipa.dict --cleanup If this flag is active, the program will create 3 datasets from the initial dictionary even if the datasets already exists in the data folder: *.part.train.cleanup *.part.dev.cleanup *.part.test.cleanup