cmusphinx / g2p-seq2seq

G2P with Tensorflow
Other
669 stars 195 forks source link

Reg: Error during seq2seq model training. #133

Open ellurunaresh opened 6 years ago

ellurunaresh commented 6 years ago

Hi all, while training the model I was getting the following error. I have followed previous blogs but I couldn't solve the issue. I could see my vocabulary is in ASCII format. I am not sure why I am getting this error. Please help me out how to solve this issue. Tensorflow version: 1.3.0

Traceback (most recent call last): File "/usr/local/bin/g2p-seq2seq", line 11, in load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')() File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 77, in main File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 198, in create_train_model File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 170, in prepare_model File "build/bdist.linux-x86_64/egg/g2p_seq2seq/seq2seq_model.py", line 178, in init__ File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1195, in model_with_buckets softmax_loss_function=softmax_loss_function)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1110, in sequence_loss softmax_loss_function=softmax_loss_function)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py", line 1067, in sequence_loss_by_example crossent = softmax_loss_function(target, logit) File "build/bdist.linux-x86_64/egg/g2p_seq2seq/seq2seq_model.py", line 117, in sampled_loss File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_impl.py", line 1191, in sampled_softmax_loss name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_impl.py", line 947, in _compute_sampled_logits range_max=num_classes) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/candidate_sampling_ops.py", line 134, in log_uniform_candidate_sampler seed2=seed2, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_candidate_sampling_ops.py", line 357, in _log_uniform_candidate_sampler name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2397, in create_op set_shapes_for_outputs(ret) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1757, in set_shapes_for_outputs shapes = shape_func(op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1707, in call_with_requiring return call_cpp_shape_fn(op, require_shape_fn=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 610, in call_cpp_shape_fn debug_python_shape_fn, require_shape_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 675, in _call_cpp_shape_fn_impl raise ValueError(err.message) "ValueError: Shape must be rank 2 but is rank 1 for 'model_with_buckets/sequence_loss/sequence_loss_by_example/sampled_softmax_loss/LogUniformCandidateSampler' (op: 'LogUniformCandidateSampler') with input shapes: [?]"

nurtas-m commented 6 years ago

Hello, @ellurunaresh Please, clone the latest version of g2p-seq2seq (6.2.0a0). Also, it is required tensorflow=>1.5.0

ellurunaresh commented 6 years ago

Actually I couldn't update tensorflow in my system. Can I solve this problem without upgradation.

On Wed, 13 Jun 2018, 9:25 pm nurtas-m, notifications@github.com wrote:

Hello, @ellurunaresh https://github.com/ellurunaresh Please, clone the latest version of g2p-seq2seq (6.2.0a0). Also, it is required tensorflow=>1.5.0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmusphinx/g2p-seq2seq/issues/133#issuecomment-396990317, or mute the thread https://github.com/notifications/unsubscribe-auth/AJuFy3BDd6L_I1UFtTTFUXvWkfvW7PXPks5t8TYOgaJpZM4UmYpC .

nurtas-m commented 6 years ago

In that case, can you, please, install tensorflow=1.5.0 only for your user (with "--user" flag: pip install tensorflow==1.5.0 --user) ?

ellurunaresh commented 6 years ago

OK sure. Thanks 😊

On Thu, 14 Jun 2018, 7:36 pm nurtas-m, notifications@github.com wrote:

In that case, can you, please, install tensorflow=1.5.0 only for your user (with "--user" flag: pip install tensorflow==1.5.0 --user) ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmusphinx/g2p-seq2seq/issues/133#issuecomment-397308253, or mute the thread https://github.com/notifications/unsubscribe-auth/AJuFyzbZkqGd0A5t852qOoOKtuIldLAaks5t8m3ZgaJpZM4UmYpC .

ellurunaresh commented 6 years ago

Hi, I am training the model with characters to word sequence using g2p approach. I am using large vocabulary size for this experiment. The new words have been added during test time and these entries do not exist in vocab.phoneme and I got "UNK" for unknown words.

1) How to handle "_UNK" during decoding. Is there any option to set the parameter so that it could take any nearest string? 2) During training can I generate "embeddings" for all unknown words?

Please help me out how to proceed further.

ellurunaresh commented 6 years ago

Please let me know how to handle this issue?

ellurunaresh commented 6 years ago

If anybody knows the solution please share it.

nurtas-m commented 6 years ago

Hello, @ellurunaresh

  1. How to handle "_UNK" during decoding. Is there any option to set the parameter so that it could take any nearest string?
  1. If you work with the problem with words boundary detection, as I had mention in issue #126, you don't need to consider any decoded symbols except "SPACE" symbol. The only information you have to utilize is the position of "SPACE" symbol. For example, you feed to the program following input sequence for decoding: '> goodafternoon

And, let's say, you receive following decoded sequence with "UNK" symbols: decodes = ["g", "o", "o", "UNK", "SPACE", "a", "v", "t", "UNK", "r", "n", "o", "e", "n"]

You, should take just "SPACE" symbols positions in decoded symbols: space_positions = [sym_pos for sym_pos, sym in enumerate(decodes) if sym == 'SPACE']

In the above example, "SPACE" symbol in decodes occurs on 4th position: print(space_positions) [4]

So, you should build output sequence from input sequence (not decoded sequence with "UNK" and other decoded symbols). And, just add white-space character in the positions where "SPACE" character found previously: output_str = "" for pos, sym in enumerate(inputs): ....if pos in space_positions: ........output_str += " " ....output_str += sym print("Input:{}".format("".join(inputs))) print("Output:{}".format(output_str))

  1. During training can I generate "embeddings" for all unknown words?

Generation and utilizing embeddings outside of tensor2tensor is problematic due to applying not only tokens but also sub-tokens for building vocabularies: https://github.com/tensorflow/tensor2tensor/issues/173