Closed loretoparisi closed 6 years ago
[UPDATE] This error seems to be related to https://github.com/tensorflow/models/issues/1736
and someone has solved it using doing
i changed 'sampled_loss(inputs, labels)' to 'sampled_loss(labels, logits) : labels = tf.reshape(labels, [-1, 1]) return tf.nn.sampled_softmax_loss(w_t, b, labels, logits, num_samples, self.target_vocab_size) ' , the error gone.
@nurtas-m A wonder: since the training it is working with TF 1.5
and the provided cmudict
example, it is possible that this is due to the dictionary format (for some reason)?
I have just realized that while in the cmudict I have got white space separated symbols for the CMU Alphabet like:
'bout B AW1 T
'cause K AH0 Z
'course K AO1 R S
'cuse K Y UW1 Z
'em AH0 M
'frisco F R IH1 S K OW0
'gain G EH1 N
'kay K EY1
'm AH0 M
'n AH0 N
in my dictionary the ipa symbols are not separated in the same way
a ʌ
a(1) ejˈ
a's ejˈz
a. ejˈ
a.'s ejˈz
a.s ejˈz
aaa tɹɪˌpʌlejˈ
aaberg ɑˈbɚg
aachen ɑˈkʌn
Of course, I'm not sure if this could be the issue.
Yes, you are absolutely right. In split_to_grapheme_phoneme() function in data_utils.py module, it is supposed that the phonemes in your input dictionary can be splitted by python's split() function. So, it is required white spaces between phonemes. Conversely, no white spaces needed between graphemes.
@nurtas-m thank you! So I should find a way to split the IPA phonemes into chars, maybe using the symbolist list, any hint?
Yes, you may just modify line 165 in data_utils.py module in order to satisfy your input dictionary format:
phonemes.append(list(split_line[1]))
@nurtas-m hello, thanks I did some changes to support at least this parameter from os.environment
so that one can eventually split phonemes. I have also added the size
param from env so that you could do
root@dbb58814d105:~# export g2p_size=512
root@dbb58814d105:~# export g2p_split_phonemes=False
You can check the changes here: https://github.com/loretoparisi/g2p-seq2seq/blob/master/g2p_seq2seq/app.py#L62
and in https://github.com/loretoparisi/g2p-seq2seq/blob/master/g2p_seq2seq/data_utils.py#L154
By the way, I was trying now the CMUDict of size=512
, and it works apparently:
root@dbb58814d105:~# export g2p_size=512
root@dbb58814d105:~# export g2p_split_phonemes=False
root@dbb58814d105:~# g2p-seq2seq --train data/dict/cmudict/cmudict.dict --model data/models/cmudict-512 2>&1 >> ./train.log &
root@dbb58814d105:~#
root@dbb58814d105:~# tail -f train.log
LR decay factor: 0.99
Max gradient norm: 5.0
Batch size: 64
Size of layer: 512
Number of layers: 2
Steps per checkpoint: 200
Max steps: 0
Optimizer: sgd
Mode: g2p
But when using the CMU-IPA se, I get the error anyways:
root@dbb58814d105:~# export g2p_size=512
root@dbb58814d105:~# export g2p_split_phonemes=True
root@dbb58814d105:~# g2p-seq2seq --train data/dict/cmudict-ipa/cmudict-ipa.dict --model data/models/cmudict-ipa/ 2>&1 >> ./train.log
Traceback (most recent call last):
File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 92, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 218, in create_train_model
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 179, in __prepare_model
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/seq2seq_model.py", line 177, in __init__
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1224, in model_with_buckets
softmax_loss_function=softmax_loss_function))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1137, in sequence_loss
softmax_loss_function=softmax_loss_function))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1092, in sequence_loss_by_example
crossent = softmax_loss_function(labels=target, logits=logit)
TypeError: sampled_loss() got an unexpected keyword argument 'logits'
Ok, thank you very much for your contribution! Can you, please, try to split phonemes with white-spaces in CMU-IPA dictionary before you feed it to the program? And check if all the lines have at least one grapheme and one phoneme. It seems, that because of parameters' order changes in tf.nn.sampled_softmax_loss() function (https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss), labels and inputs get mixed up. So, can you, please, add the names of parameters when you call tf.nn.sampled_softmax_loss() function in seq2seq_model.py module in a sampled_loss() function at the row 115: tf.nn.sampled_softmax_loss(weights=local_w_t, biases=local_b, inputs=local_inputs, labels=labels, num_sampled=num_samples, num_classes=self.target_vocab_size)
@nurtas-m thanks still investigating, at this time I have swtiched to TF 1.6:
# python -c 'import tensorflow as tf; print(tf.__version__)'
1.6.0
and I was able to start training without the issue, so that seems to be the cause (that change in the method signature). Now I'm looking to the IPA
dictionary phonemes list, since as far as I know, some symbols are made of two chars like W'
so, the split expression should be (could be) simply a list of the IPA symbols.
[UPDATE]
So with this setup and 64 nodes layer size, I get no accuracy improvements apparently:
No improvement over last 24 times. Training will stop after -16iterations if no improvement was seen.
global step 24000 learning rate 0.4090 step-time 0.15 perplexity 1.01
eval: perplexity 1.02
No improvement over last 25 times. Training will stop after -17iterations if no improvement was seen.
Training done.
Loading vocabularies from data/models/cmudict
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict
Words: 13510
Errors: 13510
WER: 1.000
and the training set up was:
root@8932ca155955:~# g2p-seq2seq --train data/dict/cmudict/cmudict.dict --model data/models/cmudict 2>&1 >> train.log &
root@8932ca155955:~# tail -f train.log
LR decay factor: 0.99
Max gradient norm: 5.0
Batch size: 64
Size of layer: 64
Number of layers: 2
Steps per checkpoint: 200
Max steps: 0
Optimizer: sgd
Mode: g2p
Preparing G2P data
Loading vocabularies from data/models/cmudict
Reading development and training data.
Creating model with parameters:
Learning rate: 0.5
LR decay factor: 0.99
Max gradient norm: 5.0
Batch size: 64
Size of layer: 64
Number of layers: 2
Steps per checkpoint: 200
Max steps: 0
Optimizer: sgd
Mode: g2p
Are you sure, you don't modify automatically created vocab.grapheme and vocab.phoneme files in data/models/cmudict/ directory after training was started? Can you, please, check that this files contain all supposed symbols (one symbol (grapheme or phoneme) per line). First 4 rows in both files should be: _PAD _GO _EOS _UNK
Can you please write how many rows in each file.
From the following log, it seems that training was done with good accuracy:
No improvement over last 24 times. Training will stop after -16iterations if no improvement was seen. global step 24000 learning rate 0.4090 step-time 0.15 perplexity 1.01 eval: perplexity 1.02 No improvement over last 25 times. Training will stop after -17iterations if no improvement was seen. Training done.
But, when you try to load trained model, it seems that symbols restored from these vocabularies mixed up:
Loading vocabularies from data/models/cmudict Creating 2 layers of 64 units. Reading model parameters from data/models/cmudict Words: 13510 Errors: 13510 WER: 1.000
@nurtas-m so, you are right, I was wrong reading the perplexity
indecision metrics. Also I think something happened while training the 512 layer size, since after it ended I can see that
the 64 layer size it works on:
root@8932ca155955:~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict/
Loading vocabularies from data/models/cmudict/
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict/
> HH EH1 L OW0
while the 512 layer size
> root@8932ca155955:~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict
cmudict/ cmudict-512/ cmudict-ipa-512/
root@8932ca155955:~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict-512/
Loading vocabularies from data/models/cmudict-512/
Creating 2 layers of 512 units.
Reading model parameters from data/models/cmudict-512/
> H H
so there is something wrong with the 512 layer size model at this time.
If you want to try it out I have just uploaded these models here You can see that you have both the models data and dictionaries there for CMU and CMU2IPA. Also you can find the created dictionary (I didn't modify them manually).
Also, I'm currently training the CMU-IPA with 512 layer size, let's see what happens.
Ok, thank you! I think that you have the errors because you have unicode symbols that not separated from each other with white spaces. So, you may supposed that after splitting phonemes with list() function you will get correctly splitted phonemes. But, because of unicode format, you may get results that you hadn't supposed. I recommend you to modify your dictionary to the format where each phonemes separated from each other with white spaces.
>>> s1 = u"ababa(1) ɑˈbʌbʌ"
>>> s1_split = s1.split()
>>> print(s1_split)
[u'ababa(1)', u'\u0251\u02c8b\u028cb\u028c']
>>> phonemes1 = s1_split[1]
>>> list(phonemes1)
['\xc9', '\x91', '\xcb', '\x88', 'b', '\xca', '\x8c', 'b', '\xca', '\x8c']
>>> s2 = u"ababa(1) ɑ ˈ b ʌ b ʌ"
>>> s2_split = s2.split()
>>> print(s2_split)
[u'ababa(1)', u'\u0251', u'\u02c8', u'b', u'\u028c', u'b', u'\u028c']
>>> phonemes2 = s2_split[1:]
>>> print(phonemes2)
[u'\u0251', u'\u02c8', u'b', u'\u028c', u'b', u'\u028c']
@nurtas-m yes that was the issue, confirmed, in fact this is the error that I get after training from the CMU-IPA dictionary with the ordinary split:
~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict-ipa-512/
Loading vocabularies from data/models/cmudict-ipa-512/
Creating 2 layers of 512 units.
Reading model parameters from data/models/cmudict-ipa-512/
> Traceback (most recent call last):
File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 105, in main
File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 427, in interactive
UnicodeEncodeError: 'ascii' codec can't encode character u'\u025b' in position 2: ordinal not in range(128)
I'm going to change the split function and/or update the dictionary than.
Closing this and opening a new issue specific for the encoding issues, thank you.
I can train and load inference from the CMUDict model with TensorFlow 1.5 (and I supposed TF 1.6 as well it will work - thanks for your help 💯 ). Now I'm trying to train from scratch a new model, based on a variant of the CMU dictionary, where I have added a different symbol set as the second column and kept the same entries in the first column, but I get an error:
The input file has the same format of the
cmudict/cmudict.dict
file, and I can see that in the output folder three files has been created before the error: