Support for Unicode Words: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 0: ordinal not in range(128)

loretoparisi commented 6 years ago

@nurtas-m Hello, I have prepared a new CMU vocabulary with IPA alphabet spaced symbols so like

abell   æ b e l
abella  æ b e l ə
abeln   æ b e l n
abelow  æ b e l əʊ
abels   æ b e l z
abelson æ b e l s n
abend   æ b e n d
abend(1)    æ b e n d
abendroth   æ b e n d r əʊ θ
aber    æ b ə
abercrombie æ b k r ɒ m ɪ
aberdeen    æ b ə d i: n
aberford    æ b ə f ə d

and I have tried a 64 nodes / layer first training:

Created model with fresh parameters.
Preparing G2P data
Loading vocabularies from data/models/cmudict-ipa-64
Reading development and training data.
Creating model with parameters:
Learning rate:        0.5
LR decay factor:      0.99
Max gradient norm:    5.0
Batch size:           64
Size of layer:        64
Number of layers:     2
Steps per checkpoint: 200
Max steps:            0
Optimizer:            sgd
Mode:                 g2p

while training I have got

Created model with fresh parameters.
global step 200 learning rate 0.5000 step-time 0.20 perplexity 2.91
  eval: perplexity 1.22
...
global step 33000 learning rate 0.3968 step-time 0.14 perplexity 1.00
  eval: perplexity 1.02
No improvement over last 17 times. Training will stop after -9iterations if no improvement was seen.
global step 33200 learning rate 0.3968 step-time 0.17 perplexity 1.00
  eval: perplexity 1.02
...
Loading vocabularies from data/models/cmudict-ipa-64
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict-ipa-64
Symbols '5' are not in vocabulary
Words: 13330
Errors: 13318
WER: 0.999
Accuracy: 0.001

and by example for inference I have got encoding errors for words that have unicode like

root@8932ca155955:~# echo "aberdeeen" | g2p-seq2seq --interactive --model data/models/cmudict-ipa-64/
Loading vocabularies from data/models/cmudict-ipa-64/
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict-ipa-64/
> Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 105, in main
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 427, in interactive
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 0: ordinal not in range(128)

so I supposed I need also to handle unicode chars in the phonemes column like:

aberdeen    æ b ə d i: n

We have this issue in bot the encoder/decoder, so the train ends up, but it does not work properly. Specifically the error should be in the here while decoding:

# Get token-ids for the input word.
    token_ids = [self.gr_vocab.get(s, data_utils.UNK_ID) for s in word]
    # Which bucket does it belong to?
    bucket_id = min([b for b in xrange(len(self._BUCKETS))
                     if self._BUCKETS[b][0] > len(token_ids)])
    # Get a 1-element batch to feed the word to the model.
    encoder_inputs, decoder_inputs, target_weights = self.model.get_batch(
        {bucket_id: [(token_ids, [])]}, bucket_id)

In fact as soon as the word has a unicode char it breaks, like for the word abascal ə b ɑ: s k l:

> root@8932ca155955:~# echo "abascal" | g2p-seq2seq --interactive --model data/models/cmudict-ipa-64/
Loading vocabularies from data/models/cmudict-ipa-64/
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict-ipa-64/
> Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 11, in <module>
    load_entry_point('g2p-seq2seq==5.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/app.py", line 105, in main
  File "build/bdist.linux-x86_64/egg/g2p_seq2seq/g2p.py", line 427, in interactive
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0259' in position 0: ordinal not in range(128)

while for words starting with ascii it is ok, but the decoder is broken internally, since it does not output next chars, starting from the unicode one like for the word lambreau l æ m b r əʊ əʊ

root@8932ca155955:~# echo "lambreau" | g2p-seq2seq --interactive --model data/models/cmudict-ipa-64/
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Loading vocabularies from data/models/cmudict-ipa-64/
Creating 2 layers of 64 units.
Reading model parameters from data/models/cmudict-ipa-64/
> l

nshmyrev commented 6 years ago

@nurtas-m waiting for your reply

nshmyrev commented 6 years ago

abend(1) æ b e n d

(1) here is wrong I believe

loretoparisi commented 6 years ago

@nshmyrev basically I'm not sure if the problem is a TF.string encoding problem or a problem in the dictionary parsing (I think no, since it's utf-8 encoded). Related to FT there is this option to generically turn Unicode into bytes using tf.compat.as_text(bytes_or_text, encoding='utf-8') - https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/compat/as_text

nurtas-m commented 6 years ago

@loretoparisi, I tried to reproduce the error that you had described. But, no errors occur during training or decoding. For the training purposes I had applied the dictionary, that you had provided: abell æ b e l abella æ b e l ə abeln æ b e l n abelow æ b e l əʊ abels æ b e l z abelson æ b e l s n abend æ b e n d abend(1) æ b e n d abendroth æ b e n d r əʊ θ aber æ b ə abercrombie æ b k r ɒ m ɪ aberdeen æ b ə d i: n aberford æ b ə f ə d

I had checkout to the old version of g2p-seq2seq that you use: commit as of Feb 4, 2018: 7532d74 And downgraded tensorflow==1.0.0

loretoparisi commented 6 years ago

@nurtas-m that's interesting so you have used my fork master in terms of commit here https://github.com/loretoparisi/g2p-seq2seq.

According to your tests, it should work on this commit as well, right?

The only differences seem to be 1) TF version 1.6 2) OS version? I'm running via Docker on Ubuntu 16.04 LTS.

By the way I'm updating my code base to match your latest commit and upgrade to TF and Tensor2Tensor in my Dockerfile, so that we can exclude any other possibile issues.

nurtas-m commented 6 years ago

No, I just used the old version of g2p-seq2seq: https://github.com/cmusphinx/g2p-seq2seq/tree/7532d741ae2c0a736e77a4d71cc248c4fc9a8d1a

loretoparisi commented 6 years ago

@nurtas-m thanks, I have moved to your master. I get this error now

[2018-04-19 16:14:45,483] Estimator's model_fn (<function wrapping_model_fn at 0x7f773ff066e0>) includes params argument, but params are not passed to Estimator.

**WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7f773ff066e0>) includes params argument, but params are not passed to Estimator. [2018-04-19 16:14:45,483] Estimator's model_fn (<function wrapping_model_fn at 0x7f773ff066e0>) includes params argument, but params are not passed to Estimator. Traceback (most recent call last): File "/usr/local/bin/g2p-seq2seq", line 9, in load_entry_point('g2p-seq2seq==6.0.0a0', 'console_scripts', 'g2p-seq2seq')() File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/app.py", line 101, in main g2p_model.prepare_datafiles(train_path=FLAGS.train, dev_path=FLAGS.valid) File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p.py", line 87, in prepare_datafiles self.problem.generate_preprocess_data(train_path, dev_path) File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 140, in generate_preprocess_data dev_preprocess_path) File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 341, in generate_files for case in train_gen: File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 178, in tabbed_generator assert len(items) > 1 AssertionError

# g2p-seq2seq --train data/dict/cmudict-ipa/cmuipa-parsed.txt --model data/models/cmudict-ipa-64 2>&1 >> train.log &
[1] 55
root@59568a01c788:~# INFO:tensorflow:Importing user module g2p_seq2seq from path /usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg
[2018-04-19 16:05:28,649] Importing user module g2p_seq2seq from path /usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg
INFO:tensorflow:Overriding hparams in transformer_base with eval_drop_long_sequences=1,batch_size=4096,num_hidden_layers=2,hidden_size=64,filter_size=256,num_heads=4,length_bucket_step=1.5,max_length=30,min_length_bucket=6
[2018-04-19 16:05:28,653] Overriding hparams in transformer_base with eval_drop_long_sequences=1,batch_size=4096,num_hidden_layers=2,hidden_size=64,filter_size=256,num_heads=4,length_bucket_step=1.5,max_length=30,min_length_bucket=6
INFO:tensorflow:schedule=train_and_evaluate
[2018-04-19 16:05:28,653] schedule=train_and_evaluate
INFO:tensorflow:worker_gpu=1
[2018-04-19 16:05:28,653] worker_gpu=1
INFO:tensorflow:sync=False
[2018-04-19 16:05:28,653] sync=False
WARNING:tensorflow:Schedule=train_and_evaluate. Assuming that training is running on a single machine.
[2018-04-19 16:05:28,653] Schedule=train_and_evaluate. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0']
[2018-04-19 16:05:28,654] datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
[2018-04-19 16:05:28,654] caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0']
[2018-04-19 16:05:28,654] ps_devices: ['gpu:0']
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 1, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb037e6c410>, '_keep_checkpoint_every_n_hours': 1, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 0, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': '', '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_is_chief': True, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_checkpoints_steps': 2000, '_environment': 'local', '_master': '', '_model_dir': 'data/models/cmudict-ipa-64', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fb03dfb0110>, '_save_summary_steps': 100}
[2018-04-19 16:05:28,655] Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 1, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb037e6c410>, '_keep_checkpoint_every_n_hours': 1, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 0, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': '', '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_is_chief': True, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_checkpoints_steps': 2000, '_environment': 'local', '_master': '', '_model_dir': 'data/models/cmudict-ipa-64', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fb03dfb0110>, '_save_summary_steps': 100}
WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7fb03dfaf578>) includes params argument, but params are not passed to Estimator.
[2018-04-19 16:05:28,655] Estimator's model_fn (<function wrapping_model_fn at 0x7fb03dfaf578>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using ValidationMonitor
[2018-04-19 16:05:28,655] Using ValidationMonitor
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
[2018-04-19 16:05:28,835] From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py:267: __init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05.
Instructions for updating:
Monitors are deprecated. Please use tf.train.SessionRunHook.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 1, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb037e6c410>, '_keep_checkpoint_every_n_hours': 1, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 0, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': '', '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_is_chief': True, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_checkpoints_steps': 2000, '_environment': 'local', '_master': '', '_model_dir': 'data/models/cmudict-ipa-64', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fb03dfb0110>, '_save_summary_steps': 100}
[2018-04-19 16:05:28,835] Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 1, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb037e6c410>, '_keep_checkpoint_every_n_hours': 1, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 0, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': '', '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_is_chief': True, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_checkpoints_steps': 2000, '_environment': 'local', '_master': '', '_model_dir': 'data/models/cmudict-ipa-64', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fb03dfb0110>, '_save_summary_steps': 100}
WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7fb037e066e0>) includes params argument, but params are not passed to Estimator.
[2018-04-19 16:05:28,835] Estimator's model_fn (<function wrapping_model_fn at 0x7fb037e066e0>) includes params argument, but params are not passed to Estimator.
Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 9, in <module>
    load_entry_point('g2p-seq2seq==6.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/app.py", line 101, in main
    g2p_model.prepare_datafiles(train_path=FLAGS.train, dev_path=FLAGS.valid)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p.py", line 87, in prepare_datafiles
    self.problem.generate_preprocess_data(train_path, dev_path)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 140, in generate_preprocess_data
    dev_preprocess_path)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 341, in generate_files
    for case in train_gen:
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 178, in tabbed_generator
    assert len(items) > 1
AssertionError

I'm using Tensorflow 1.4.0 (since I cannot upgrade to CUDA 9.0 at this time) and latest Tensor2Tensor.

nurtas-m commented 6 years ago

@loretoparisi, "--model" flag is reserved by tensor2tensor program, so we constrained to rename this flag to "--model_dir". Can you, please, change the flag name?

loretoparisi commented 6 years ago

@nurtas-m ok! replaced, now it seems the flag it is ok but I get

$g2p-seq2seq --train data/dict/cmudict-ipa/cmuipa-parsed.txt --model_dir data/models/cmudict-ipa-64 2>&1 >> train.log &

, '_save_checkpoints_steps': 2000, '_environment': 'local', '_master': '', '_model_dir': 'data/models/cmudict-ipa-64', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fe303070110>, '_save_summary_steps': 100}
WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7fe2fcec66e0>) includes params argument, but params are not passed to Estimator.
[2018-04-19 16:25:12,409] Estimator's model_fn (<function wrapping_model_fn at 0x7fe2fcec66e0>) includes params argument, but params are not passed to Estimator.
Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 9, in <module>
    load_entry_point('g2p-seq2seq==6.0.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/app.py", line 101, in main
    g2p_model.prepare_datafiles(train_path=FLAGS.train, dev_path=FLAGS.valid)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p.py", line 87, in prepare_datafiles
    self.problem.generate_preprocess_data(train_path, dev_path)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 140, in generate_preprocess_data
    dev_preprocess_path)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 341, in generate_files
    for case in train_gen:
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.0.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 178, in tabbed_generator
    assert len(items) > 1
AssertionError

nurtas-m commented 6 years ago

@loretoparisi, Clone the latest version of g2p-seq2seq (6.1.0a0), please. I tried to launch with following versions of programs, and it works fine: tensorflow=1.4.0, tensor2tensor=1.5.7, g2p-seq2seq=1.6.1 But, I have to warn you, that "--interactive" mode is still crushing:(( We are working on it now

nurtas-m commented 6 years ago

Ok, we have fixed the "--interactive" mode. Can you, please, confirm if you succeeded in launching the program?

loretoparisi commented 6 years ago

@nurtas-m thanks! I'm trying right now with the latest versions.

loretoparisi commented 6 years ago

@nurtas-m So, this is what I did

1) changed the setup.py to match tensorflow 1.4.0, because I cannot use Cuda 9.0 right now; 2) installed then

pip install tensorflow-gpu==1.4.0 && \
    pip install tensor2tensor==1.5.7

and your master codebase - see the Dockerfile here

Then I run as

g2p-seq2seq --train data/dict/cmudict-ipa/cmuipa-parsed.txt --model_dir data/models/cmudict-ipa-64 2>&1 >> train.log &

And I get the same error that before.

I have then tried to change the dict to the standard CMU arpabet, then I get

$ g2p-seq2seq --train data/dict/cmudict/cmudict.dict --model_dir data/models/cmudict-64 2>&1 >> train.log &

WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7fcc730fd668>) includes params argument, but params are not passed to Estimator.
[2018-04-20 12:34:13,744] Estimator's model_fn (<function wrapping_model_fn at 0x7fcc730fd668>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Reading data files from data/dict/cmudict/train
[2018-04-20 12:34:23,955] Reading data files from data/dict/cmudict/train
tensorflow.python.framework.errors_impl.NotFoundError: data/dict/cmudict/train; No such file or directory

What is the train file?

nurtas-m commented 6 years ago

@loretoparisi,

What is the train file?

It's a preprocessed train dictionary file. This file utilized during training.

I have fixed this problem now. Please, download the latest version of g2p-seq2seq (ver 6.1.1a0) and try to retrain it.

loretoparisi commented 6 years ago

@nurtas-m ok thanks a lot 👍 , that error is now fixed. The only issue left is the previous one

root@139975b2cf92:~# g2p-seq2seq --train data/dict/cmudict-ipa/cmuipa-parsed.txt --model_dir data/models/cmudict-ipa-64 2>&1 >> train.log &
Traceback (most recent call last):
  File "/usr/local/bin/g2p-seq2seq", line 9, in <module>
    load_entry_point('g2p-seq2seq==6.1.0a0', 'console_scripts', 'g2p-seq2seq')()
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.1.0a0-py2.7.egg/g2p_seq2seq/app.py", line 101, in main
    g2p_model.prepare_datafiles(train_path=FLAGS.train, dev_path=FLAGS.valid)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.1.0a0-py2.7.egg/g2p_seq2seq/g2p.py", line 87, in prepare_datafiles
    self.problem.generate_preprocess_data(train_path, dev_path)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.1.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 135, in generate_preprocess_data
    eval_preprocess_path)
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.1.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 336, in generate_files
    for case in train_gen:
  File "/usr/local/lib/python2.7/dist-packages/g2p_seq2seq-6.1.0a0-py2.7.egg/g2p_seq2seq/g2p_problem.py", line 173, in tabbed_generator
    assert len(items) > 1
AssertionError

The only difference that I see is that I'm using TF 1.40, that is https://github.com/cmusphinx/g2p-seq2seq/blob/master/g2p_seq2seq/g2p_problem.py#L173

loretoparisi commented 6 years ago

@nurtas-m SOLVED IF FINALLY!!!! 🚔 😀

INFO:tensorflow:Base learning rate: 0.200000
[2018-04-20 15:24:47,887] Base learning rate: 0.200000
INFO:tensorflow:Trainable Variables Total size: 261696
[2018-04-20 15:24:47,897] Trainable Variables Total size: 261696
INFO:tensorflow:Using optimizer Adam
[2018-04-20 15:24:47,898] Using optimizer Adam
INFO:tensorflow:Create CheckpointSaverHook.
[2018-04-20 15:24:51,558] Create CheckpointSaverHook.
2018-04-20 15:24:54.241250: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-04-20 15:24:54.319765: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-20 15:24:54.320070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.91GiB
2018-04-20 15:24:54.320107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0)

It was due to missing values in the phoneme column so I had the AssertionError on len(items) > 1.

Thanks a lot, closing that, I will update you with the results of this CMU - IPA model!

nshmyrev commented 6 years ago

@nurtas-m add a check and provide a meaningful error message in this case.

loretoparisi commented 6 years ago

@nshmyrev good idea, in fact was difficult to check this, just 2-3 missing value in 100K rows! Thanks!

nshmyrev commented 6 years ago

@nurtas-m make it a warning instead of error and ignore wrong line. Fix English.

cmusphinx / g2p-seq2seq

Support for Unicode Words: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 0: ordinal not in range(128) #110