cmusphinx / g2p-seq2seq

G2P with Tensorflow
Other
669 stars 195 forks source link

TypeError: sampled_loss() got an unexpected keyword argument 'logits #104

Closed loretoparisi closed 6 years ago

loretoparisi commented 6 years ago

I can train and load inference from the CMUDict model with TensorFlow 1.5 (and I supposed TF 1.6 as well it will work - thanks for your help 💯 ). Now I'm trying to train from scratch a new model, based on a variant of the CMU dictionary, where I have added a different symbol set as the second column and kept the same entries in the first column, but I get an error:

  File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 83, in main
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 144, in prepare_data
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/data_utils.py", line 250, in prepare_g2p_data
  File "/usr/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 17] File exists: 'data/models/cmudict-test/'
Preparing G2P data
Creating vocabulary data/models/cmudict-test/vocab.phoneme
Creating vocabulary data/models/cmudict-test/vocab.grapheme
Reading development and training data.
Creating model with parameters:
Learning rate:        0.5
LR decay factor:      0.99
Max gradient norm:    5.0
Batch size:           64
Size of layer:        64
Number of layers:     2
Steps per checkpoint: 200
Max steps:            0
Optimizer:            sgd
Mode:                 g2p

Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 87, in main
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 218, in create_train_model
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 179, in __prepare_model
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/seq2seq_model.py", line 177, in __init__
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1224, in model_with_buckets
    softmax_loss_function=softmax_loss_function))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1137, in sequence_loss
    softmax_loss_function=softmax_loss_function))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1092, in sequence_loss_by_example
    crossent = softmax_loss_function(labels=target, logits=logit)
TypeError: sampled_loss() got an unexpected keyword argument 'logits'

The input file has the same format of the cmudict/cmudict.dict file, and I can see that in the output folder three files has been created before the error:

# ls -l data/models/cmudict-test/
total 1264
-rw-r--r-- 1 root root      20 Mar 27 09:37 model.params
-rw-r--r-- 1 root root     103 Mar 27 09:37 vocab.grapheme
-rw-r--r-- 1 root root 1285528 Mar 27 09:37 vocab.phoneme
loretoparisi commented 6 years ago

[UPDATE] This error seems to be related to https://github.com/tensorflow/models/issues/1736

and someone has solved it using doing

i changed 'sampled_loss(inputs, labels)' to 'sampled_loss(labels, logits) : labels = tf.reshape(labels, [-1, 1]) return tf.nn.sampled_softmax_loss(w_t, b, labels, logits, num_samples, self.target_vocab_size) ' , the error gone.

loretoparisi commented 6 years ago

@nurtas-m A wonder: since the training it is working with TF 1.5 and the provided cmudict example, it is possible that this is due to the dictionary format (for some reason)? I have just realized that while in the cmudict I have got white space separated symbols for the CMU Alphabet like:

'bout B AW1 T
'cause K AH0 Z
'course K AO1 R S
'cuse K Y UW1 Z
'em AH0 M
'frisco F R IH1 S K OW0
'gain G EH1 N
'kay K EY1
'm AH0 M
'n AH0 N

in my dictionary the ipa symbols are not separated in the same way

a ʌ
a(1) ejˈ
a's ejˈz
a. ejˈ
a.'s ejˈz
a.s ejˈz
aaa tɹɪˌpʌlejˈ
aaberg ɑˈbɚg
aachen ɑˈkʌn

Of course, I'm not sure if this could be the issue.

nurtas-m commented 6 years ago

Yes, you are absolutely right. In split_to_grapheme_phoneme() function in data_utils.py module, it is supposed that the phonemes in your input dictionary can be splitted by python's split() function. So, it is required white spaces between phonemes. Conversely, no white spaces needed between graphemes.

loretoparisi commented 6 years ago

@nurtas-m thank you! So I should find a way to split the IPA phonemes into chars, maybe using the symbolist list, any hint?

nurtas-m commented 6 years ago

Yes, you may just modify line 165 in data_utils.py module in order to satisfy your input dictionary format: phonemes.append(list(split_line[1]))

loretoparisi commented 6 years ago

@nurtas-m hello, thanks I did some changes to support at least this parameter from os.environment so that one can eventually split phonemes. I have also added the size param from env so that you could do

root@dbb58814d105:~# export g2p_size=512
root@dbb58814d105:~# export g2p_split_phonemes=False

You can check the changes here: https://github.com/loretoparisi/g2p-seq2seq/blob/master/g2p_seq2seq/app.py#L62

and in https://github.com/loretoparisi/g2p-seq2seq/blob/master/g2p_seq2seq/data_utils.py#L154

By the way, I was trying now the CMUDict of size=512, and it works apparently:

root@dbb58814d105:~# export g2p_size=512
root@dbb58814d105:~# export g2p_split_phonemes=False
root@dbb58814d105:~# g2p-seq2seq --train data/dict/cmudict/cmudict.dict --model data/models/cmudict-512 2>&1 >> ./train.log &
root@dbb58814d105:~# 
root@dbb58814d105:~# tail -f train.log 
LR decay factor:      0.99
Max gradient norm:    5.0
Batch size:           64
Size of layer:        512
Number of layers:     2
Steps per checkpoint: 200
Max steps:            0
Optimizer:            sgd
Mode:                 g2p

But when using the CMU-IPA se, I get the error anyways:

root@dbb58814d105:~# export g2p_size=512
root@dbb58814d105:~# export g2p_split_phonemes=True
root@dbb58814d105:~#  g2p-seq2seq --train data/dict/cmudict-ipa/cmudict-ipa.dict --model data/models/cmudict-ipa/ 2>&1 >> ./train.log
Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 92, in main
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 218, in create_train_model
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 179, in __prepare_model
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/seq2seq_model.py", line 177, in __init__
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1224, in model_with_buckets
    softmax_loss_function=softmax_loss_function))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1137, in sequence_loss
    softmax_loss_function=softmax_loss_function))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1092, in sequence_loss_by_example
    crossent = softmax_loss_function(labels=target, logits=logit)
TypeError: sampled_loss() got an unexpected keyword argument 'logits'
nurtas-m commented 6 years ago

Ok, thank you very much for your contribution! Can you, please, try to split phonemes with white-spaces in CMU-IPA dictionary before you feed it to the program? And check if all the lines have at least one grapheme and one phoneme. It seems, that because of parameters' order changes in tf.nn.sampled_softmax_loss() function (https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss), labels and inputs get mixed up. So, can you, please, add the names of parameters when you call tf.nn.sampled_softmax_loss() function in seq2seq_model.py module in a sampled_loss() function at the row 115: tf.nn.sampled_softmax_loss(weights=local_w_t, biases=local_b, inputs=local_inputs, labels=labels, num_sampled=num_samples, num_classes=self.target_vocab_size)

loretoparisi commented 6 years ago

@nurtas-m thanks still investigating, at this time I have swtiched to TF 1.6:

# python -c 'import tensorflow as tf; print(tf.__version__)'
1.6.0

and I was able to start training without the issue, so that seems to be the cause (that change in the method signature). Now I'm looking to the IPA dictionary phonemes list, since as far as I know, some symbols are made of two chars like W' so, the split expression should be (could be) simply a list of the IPA symbols.

loretoparisi commented 6 years ago

[UPDATE]

So with this setup and 64 nodes layer size, I get no accuracy improvements apparently:

No improvement over last 24 times. Training will stop after -16iterations if no improvement was seen.
global step 24000 learning rate 0.4090 step-time 0.15 perplexity 1.01
  eval: perplexity 1.02
No improvement over last 25 times. Training will stop after -17iterations if no improvement was seen.
Training done.
Loading vocabularies from data/models/cmudict
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict
Words: 13510
Errors: 13510
WER: 1.000

and the training set up was:

root@8932ca155955:~# g2p-seq2seq --train data/dict/cmudict/cmudict.dict --model data/models/cmudict 2>&1 >> train.log &
root@8932ca155955:~# tail -f train.log 
LR decay factor:      0.99
Max gradient norm:    5.0
Batch size:           64
Size of layer:        64
Number of layers:     2
Steps per checkpoint: 200
Max steps:            0
Optimizer:            sgd
Mode:                 g2p

Preparing G2P data
Loading vocabularies from data/models/cmudict
Reading development and training data.
Creating model with parameters:
Learning rate:        0.5
LR decay factor:      0.99
Max gradient norm:    5.0
Batch size:           64
Size of layer:        64
Number of layers:     2
Steps per checkpoint: 200
Max steps:            0
Optimizer:            sgd
Mode:                 g2p
nurtas-m commented 6 years ago

Are you sure, you don't modify automatically created vocab.grapheme and vocab.phoneme files in data/models/cmudict/ directory after training was started? Can you, please, check that this files contain all supposed symbols (one symbol (grapheme or phoneme) per line). First 4 rows in both files should be: _PAD _GO _EOS _UNK

Can you please write how many rows in each file.

From the following log, it seems that training was done with good accuracy: No improvement over last 24 times. Training will stop after -16iterations if no improvement was seen. global step 24000 learning rate 0.4090 step-time 0.15 perplexity 1.01 eval: perplexity 1.02 No improvement over last 25 times. Training will stop after -17iterations if no improvement was seen. Training done.

But, when you try to load trained model, it seems that symbols restored from these vocabularies mixed up: Loading vocabularies from data/models/cmudict Creating 2 layers of 64 units. Reading model parameters from data/models/cmudict Words: 13510 Errors: 13510 WER: 1.000

loretoparisi commented 6 years ago

@nurtas-m so, you are right, I was wrong reading the perplexity indecision metrics. Also I think something happened while training the 512 layer size, since after it ended I can see that

the 64 layer size it works on:

root@8932ca155955:~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict/
Loading vocabularies from data/models/cmudict/
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict/
> HH EH1 L OW0

while the 512 layer size

> root@8932ca155955:~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict 
cmudict/         cmudict-512/     cmudict-ipa-512/ 
root@8932ca155955:~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict-512/
Loading vocabularies from data/models/cmudict-512/
Creating 2 layers of 512 units.
Reading model parameters from data/models/cmudict-512/
> H H

so there is something wrong with the 512 layer size model at this time.

If you want to try it out I have just uploaded these models here You can see that you have both the models data and dictionaries there for CMU and CMU2IPA. Also you can find the created dictionary (I didn't modify them manually).

Also, I'm currently training the CMU-IPA with 512 layer size, let's see what happens.

nurtas-m commented 6 years ago

Ok, thank you! I think that you have the errors because you have unicode symbols that not separated from each other with white spaces. So, you may supposed that after splitting phonemes with list() function you will get correctly splitted phonemes. But, because of unicode format, you may get results that you hadn't supposed. I recommend you to modify your dictionary to the format where each phonemes separated from each other with white spaces.

>>> s1 = u"ababa(1) ɑˈbʌbʌ"

>>> s1_split = s1.split()

>>> print(s1_split)

[u'ababa(1)', u'\u0251\u02c8b\u028cb\u028c']

>>> phonemes1 = s1_split[1]

>>> list(phonemes1)

['\xc9', '\x91', '\xcb', '\x88', 'b', '\xca', '\x8c', 'b', '\xca', '\x8c']

>>> s2 = u"ababa(1) ɑ ˈ b ʌ b ʌ"

>>> s2_split = s2.split()

>>> print(s2_split)

[u'ababa(1)', u'\u0251', u'\u02c8', u'b', u'\u028c', u'b', u'\u028c']

>>> phonemes2 = s2_split[1:]

>>> print(phonemes2)

[u'\u0251', u'\u02c8', u'b', u'\u028c', u'b', u'\u028c']
loretoparisi commented 6 years ago

@nurtas-m yes that was the issue, confirmed, in fact this is the error that I get after training from the CMU-IPA dictionary with the ordinary split:

~# echo "hello" | g2p-seq2seq --interactive --model data/models/cmudict-ipa-512/
Loading vocabularies from data/models/cmudict-ipa-512/
Creating 2 layers of 512 units.
Reading model parameters from data/models/cmudict-ipa-512/
> Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 105, in main
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 427, in interactive
UnicodeEncodeError: 'ascii' codec can't encode character u'\u025b' in position 2: ordinal not in range(128)

I'm going to change the split function and/or update the dictionary than.

loretoparisi commented 6 years ago

Closing this and opening a new issue specific for the encoding issues, thank you.