Open loretoparisi opened 6 years ago
Hello, @loretoparisi What files you cannot find in your dict folder? Do you mean, that there were some of your dictionary files, and after the training was ended some of them were erased?
@nurtas-m so I'm missing the files generated in the data/dict
folders when then training of my CMU-IPA dict ends. I have only the checkpoint model files:
:~/docker/g2p-seq2seq/data/dict/cmudict-ipa$ ls -l
.
..
~/docker/g2p-seq2seq/data/models/cmudict-ipa-256$ ls
checkpoint model.ckpt-148001.index model.ckpt-42001.data-00001-of-00002
eval model.ckpt-148001.meta model.ckpt-42001.index
eval.preprocessed model.ckpt-162001.data-00000-of-00002 model.ckpt-42001.meta
events.out.tfevents.1525880491.800125bf05cd model.ckpt-162001.data-00001-of-00002 model.ckpt-54001.data-00000-of-00002
graph.pbtxt model.ckpt-162001.index model.ckpt-54001.data-00001-of-00002
model.ckpt-108001.data-00000-of-00002 model.ckpt-162001.meta model.ckpt-54001.index
model.ckpt-108001.data-00001-of-00002 model.ckpt-176001.data-00000-of-00002 model.ckpt-54001.meta
model.ckpt-108001.index model.ckpt-176001.data-00001-of-00002 model.ckpt-68001.data-00000-of-00002
model.ckpt-108001.meta model.ckpt-176001.index model.ckpt-68001.data-00001-of-00002
model.ckpt-122001.data-00000-of-00002 model.ckpt-176001.meta model.ckpt-68001.index
model.ckpt-122001.data-00001-of-00002 model.ckpt-190001.data-00000-of-00002 model.ckpt-68001.meta
model.ckpt-122001.index model.ckpt-190001.data-00001-of-00002 model.ckpt-82001.data-00000-of-00002
model.ckpt-122001.meta model.ckpt-190001.index model.ckpt-82001.data-00001-of-00002
model.ckpt-136001.data-00000-of-00002 model.ckpt-190001.meta model.ckpt-82001.index
model.ckpt-136001.data-00001-of-00002 model.ckpt-200000.data-00000-of-00002 model.ckpt-82001.meta
model.ckpt-136001.index model.ckpt-200000.data-00001-of-00002 model.ckpt-96001.data-00000-of-00002
model.ckpt-136001.meta model.ckpt-200000.index model.ckpt-96001.data-00001-of-00002
model.ckpt-14001.data-00000-of-00002 model.ckpt-200000.meta model.ckpt-96001.index
model.ckpt-14001.data-00001-of-00002 model.ckpt-28001.data-00000-of-00002 model.ckpt-96001.meta
model.ckpt-14001.index model.ckpt-28001.data-00001-of-00002 model.params
model.ckpt-14001.meta model.ckpt-28001.index train.preprocessed
model.ckpt-148001.data-00000-of-00002 model.ckpt-28001.meta vocab.g2p
model.ckpt-148001.data-00001-of-00002 model.ckpt-42001.data-00000-of-00002
When you launch the training mode, the program creates a new model directory that you set up for the "--model_dir" flag (in your case it will be ~/docker/g2p-seq2seq/data/models/cmudict-ipa directory). If the directory you set up to the flag "--model_dir" is already exists, then the program checks if there "checkpoint" file in that folder. If the "checkpoint" file found, the program will load the model that pointed out in "checkpoint" file and continue to train it. Otherwise, it will create new model, param, vocab and graph files: model.ckpt-*** model.params graph.pbtxt vocab.g2p Also, the program creates some extensive files that it will use only during training (like "eval.preprocessed", "train.preprocessed"). If you launch the training mode with "--reinit" flag, the program will delete the folder that you set up to the "--model_dir" flag (if the folder exists) and creates new folder with the same name.
@nurtas-m ok thanks. Question. If the input dictionary will be split into train and test, when I run the evaluation, how to use the test dictionary only i.e. the WER results of the --evaluate
is on the training or on the test set? Thank you.
If you have just one file and don't want to split it to train, development and test sets by yourself, you need to launch the program without any "--valid" and "--test" flags:
$ g2p-seq2seq --model_dir your/new/model/directory --train data_folder/cmudict.dict
The program will create in your data folder additional files obtained from the file that you set up to the "--train" flag, like: cmudict.dict.part.train cmudict.dict.part.dev cmudict.dict.part.test
After the training is over, you can set up "cmudict.dict.part.test" file to the "--evaluate" flag and check for word error rate:
$ g2p-seq2seq --model_dir your/new/model/directory --evaluate data_folder/cmudict.dict.test
Also, if you set up "--cleanup" flag during training, the program will create cleaned up files without stress and comments: cmudict.dict.part.train.cleanup cmudict.dict.part.dev.cleanup cmudict.dict.part.test.cleanup
So, if you have an initial dictionary file with the lines like:
ababa AH0 B AA1 B AH0
ababa(2) AA1 B AH0 B AH0
d'artagnan D AH0 R T AE1 NG Y AH0 N # foreign french
you will receive cleaned up files with the following lines:
ababa AH B AA B AH
ababa AA B AH B AH
d'artagnan D AH R T AE NG Y AH N
@nurtas-m ok thanks a lot. The last problem is that when I do a training from scratch I cannot see the files
cmudict.dict.part.train
cmudict.dict.part.dev
cmudict.dict.part.test
in the data/dict/cmudict-ipa/
folder. I'm not specifying any flag a part the default ones:
g2p-seq2seq --train data/dict/cmudict-ipa/cmudict-ipa.dict --model_dir data/models/cmudict-ipa
What version of g2p-seq2seq you have installed? We add the possibility of splitting the datasets starting from ver 6.1.4a0.
I'm on the latest version (since I need the fix of the inference and frozen model stuff...).
I don't know why in your case the program doesn't split an initial dictionary into 3 datasets.
Ok, can you, please, set up "--cleanup" flag during training:
$ g2p-seq2seq --model_dir data/models/cmudict-ipa --train data/dict/cmudict-ipa/cmudict-ipa.dict --cleanup
If this flag is active, the program will create 3 datasets from the initial dictionary even if the datasets already exists in the data folder:
*.part.train.cleanup
*.part.dev.cleanup
*.part.test.cleanup
I did a training of the standard CMU Dict and I get in the
data/dict
folder the dictionaries as expected:I have then started the training of my CMU 2 IPA dict, that actually works in some way and ends up with the testing as well:
But when I check the dict folder I cannot see those files, I just have my CMU-IPA dict file.