NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Questions about Transfer Learning. #355

Open RichardsonLiao opened 5 years ago

RichardsonLiao commented 5 years ago

Hello everyone, we are trying transfer learning recently, and here are a couple of things we have discovered:

  1. When we try transfer learning (using “load_model” parameter in “config_file”), the program takes a long time after function “assign_ops” (to restore variables) is finished, and memory usage is also very high. We also observe that this scenario is not happening in normal training or using “continue_learning” parameter.

  2. When trying transfer learning (we modify the code to compulsory restore these two variables: “bn/moving_mean” and “bn/moving_variance”), the training loss will be normally low in only step 0, then the training loss exploded after that. Even if we set LR as 0 and use the training data of pre-trained model. We found that the issue described above only happens when “dtype” is “mixed”, the training works normally if it is set as tf.float32. And here are the combination we have tried:

Only second one and third one work normally.

  1. We set the parameter “print_loss_steps” in “config_file” as 1, and observed the situation below. In step 0, the program prints 11 different training losses; in step 1, it prints 7 training loss. Then each step will only print one training loss as we expected after step 2. By the way, we use Horovod and “num_gpus” is set as 8.

Configuration:

    "random_seed": 0,
    "use_horovod": True,
    "num_epochs": 1,

    "num_gpus": 8,
    "batch_size_per_gpu": 16,
    "iter_size": 1,

    "save_summaries_steps": 100,
    "print_loss_steps": 1,
    "print_samples_steps": 2200,
    "eval_steps": 2200,
    "save_checkpoint_steps": 1100,
    "num_checkpoints": 1,

Training information:

*** Epoch 0, global step 0: ***     Train loss: 3.5081 
time per step = 0:24:47.373
***     Sample WER: 0.6000
***     Sample target:     when ashur natsir pal died
***     Sample prediction: when ashur natzsir polodied
*** Epoch 0, global step 0: ***     Train loss: 4.2053 
time per step = 0:00:20.477
*** Epoch 0, global step 0: ***     Train loss: 3.2743 
time per step = 0:00:7.077
*** Epoch 0, global step 0: ***     Train loss: 3.7765 
time per step = 0:00:6.780
*** Epoch 0, global step 0: ***     Train loss: 1.9787 
time per step = 0:00:7.466
*** Epoch 0, global step 0: ***     Train loss: 2.4200 
time per step = 0:00:8.330
*** Epoch 0, global step 0: ***     Train loss: 3.4003 
time per step = 0:00:7.882
*** Epoch 0, global step 0: ***     Train loss: 3.1898 
time per step = 0:00:7.498
*** Epoch 0, global step 0: ***     Train loss: 4.4040 
time per step = 0:00:9.195
*** Epoch 0, global step 0: ***     Train loss: 2.0181 
time per step = 0:00:10.196
*** Epoch 0, global step 0: ***     Train loss: 1.5686 
time per step = 0:00:9.415
*** Epoch 0, global step 0: ***     Train loss: 2.5663 
time per step = 0:00:9.227
*** Epoch 0, global step 1: ***     Train loss: 1261.3043 
time per step = 0:00:9.849
*** Epoch 0, global step 1: ***     Train loss: 1218.6698 
time per step = 0:00:9.150
*** Epoch 0, global step 1: ***     Train loss: 1317.1223 
time per step = 0:00:9.792
*** Epoch 0, global step 1: ***     Train loss: 1286.2400 
time per step = 0:00:8.314
*** Epoch 0, global step 1: ***     Train loss: 1181.7028 
time per step = 0:00:9.601
*** Epoch 0, global step 1: ***     Train loss: 1253.2593 
time per step = 0:00:9.329
*** Epoch 0, global step 1: ***     Train loss: 1203.3721 
time per step = 0:00:10.711
*** Epoch 0, global step 2: ***     Train loss: 1210.5000 
time per step = 0:00:9.804
*** Epoch 0, global step 3: ***     Train loss: 1108.2490 
time per step = 0:00:9.214
*** Epoch 0, global step 4: ***     Train loss: 1072.8215 
time per step = 0:00:8.263
  1. When we try transfer learning and running with Horovod, the program says the variables of “Loss_Optimization” cannot be loaded.

Our conclusion is transfer learning only works with dtype of tf.float32. Can someone helps us explaining this situation? Thanks a lot!

borisgin commented 5 years ago

Can you attach the complete logs for mixed precision, please?

RichardsonLiao commented 5 years ago

Thanks for replying!

Here is the pre-trained model, we trained it with mixed precision:

[[7269,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from scratch
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7f4105f5bae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed'
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7f411bbfa7b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7f4105f5bae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 50,
 'iter_size': 1,
 'larc_params': {'larc_eta': 0.001},
 'load_model': '',
 'logdir': 'w2ltestmp',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'lr_policy': <function poly_decay at 0x7f4100115378>,
 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
 'num_checkpoints': 1,
 'num_epochs': 600,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 80,
 'print_samples_steps': 80,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7f4105edfe18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float16_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
2019-02-13 01:03:41.034278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:03:41.034341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 01:03:41.618036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:03:41.618100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 01:03:41.624068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:03:41.624093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 01:03:41.624118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 01:03:41.625121: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 01:03:42.339535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:03:42.339595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 01:03:42.339622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 01:03:42.340540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
[c08cb0a9b3b6:58974] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:58974] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: ***     Train loss: 946.1537 
time per step = 0:00:0.110
***     Sample WER: 4.3333
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: tom h vmv tdsqmi bhxmqditmqydq pmztcf fmx fh  djdm m' yc z  vhv mscvtmpxuhfhqm u  tdhdvpdepdn k u'ephmpdym e vkhcf tziahxmdh dj mnhphusyv'tqma'jq qmtv itda a' vqtpa ' vei gkd th qu r dxv hptqjotmptdkqdnt jtvtipq odtc dhvh t  hqpsimtqahyd xstm m'ilx'klqpvhid 'qyt' tq htv q'jmqjc'tqde dliqdtq  tmjmbgvc jivtjeuheavmcvsqymdaphqhtdrqnkdh fxudk ncqpdqz snapcvctrbedctd
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: ***     Train loss: 177.6312 
time per step = 0:00:0.116
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: e a a e
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: ***     Train loss: 99.9032 
time per step = 0:00:0.082
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction:  nuup
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: ***     Train loss: 110.6862 
time per step = 0:00:0.075
***     Sample WER: 0.9167
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: heutti  a llt ooneyut i stt neggh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: ***     Train loss: 77.5739 
time per step = 0:00:0.080
***     Sample WER: 0.9500
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: ats flull f meenadn n  usiess suts s rrsh rrm  woorooo rhappppy ooau  wwere nnot anuunccmmon s sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: ***     Train loss: 114.4641 
time per step = 0:00:0.078
***     Sample WER: 1.0000
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction: oue  t ne  y to echnge alaleneoudred tan t  y o onn gh looa
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: ***     Train loss: 55.4402 
time per step = 0:00:0.077
***     Sample WER: 0.6667
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: hhaeptt  in  a lootoff money buu iss that eenouggh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: ***     Train loss: 67.9414 
time per step = 0:00:0.081
***     Sample WER: 0.5833
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: teyaee ut iin a llot o of money but is that eenuughh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: ***     Train loss: 43.4279 
time per step = 0:00:0.077
***     Sample WER: 0.5000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: boats ful of men and women n busiins suuits fresh rom work or hpy ourr weere n not an ncommon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: ***     Train loss: 43.1768 
time per step = 0:00:0.081
***     Sample WER: 0.7727
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: mistterr jones will ovveresesee the coman's  opereatig uunitsas wewel ass thhe company's rearcch auctilvitities andsaaaf supporttserice  the ccompany said
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 400, global step 800: ***     Train loss: 59.6981 
time per step = 0:00:0.085
***     Sample WER: 0.9091
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: thte oynnes wilovesee t he compan's' oppertting  nitgs sas swewlll as s te companys reseaarch tiv tites nd sstaff support seics the ccommpany said
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 440, global step 880: ***     Train loss: 11.7136 
time per step = 0:00:0.081
***     Sample WER: 0.4167
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: thhy  havve pput in a lot of money buh is that enoughh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 480, global step 960: ***     Train loss: 17.9923 
time per step = 0:00:0.089
***     Sample WER: 0.2500
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: baats fulll of meen and women in buusines suits fresh from work or hchappy hour were not an uncommon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 520, global step 1040: ***     Train loss: 11.5907 
time per step = 0:00:0.076
***     Sample WER: 0.4444
***     Sample target:     quote there aren't any financial irregularities unquote he says
***     Sample prediction: quote  there  arent any fiianial ireguularies unquote he saynyss
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 560, global step 1120: ***     Train loss: 8.3654 
time per step = 0:00:0.084
***     Sample WER: 0.4000
***     Sample target:     there was no autopsy period
***     Sample prediction: there waas no autlopsy period
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.082s
*** Avg objects per second: 31441.706

The final training loss is approximately 8.

  1. First, we try "Pre-trained model: mixed -> transfer learning configuration: mixed."

Here is the training log:

[[3443,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7f54d7312ae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7f54eaf9c7b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7f54d7312ae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 50,
 'iter_size': 1,
 'larc_params': {'larc_eta': 0.001},
 'load_model': 'w2ltestmp',
 'logdir': 'w2ltestmpTomp',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'lr_policy': <function poly_decay at 0x7f54d3504378>,
 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
 'num_checkpoints': 1,
 'num_epochs': 600,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 80,
 'print_samples_steps': 80,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7f54d72a7e18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Building graph in Horovod rank: 0
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float16_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmp.
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 01:09:16.523653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:09:16.523747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 01:09:17.212111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:09:17.212155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 01:09:17.212182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 01:09:17.213765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
[c08cb0a9b3b6:63304] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:63304] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 01:09:17.401701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 01:09:17.401742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 01:09:18.225671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 01:09:18.225730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 01:09:18.225741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 01:09:18.226926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"

checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
assign_ops [<tf.Tensor 'Assign_79:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_89:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_91:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_93:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_94:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_96:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_98:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_102:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_105:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_111:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_113:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_114:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_115:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_117:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_118:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_119:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_120:0' shape=(160,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: ***     Train loss: 13.3622 
time per step = 0:00:0.293
***     Sample WER: 0.5000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: inn the proceess the suits alle assets were oversstated and liailtieessnderstateed
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: ***     Train loss: 184.3359 
time per step = 0:00:0.126
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: te  nir ir
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: ***     Train loss: 108.2366 
time per step = 0:00:0.097
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction: e
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: ***     Train loss: 125.5679 
time per step = 0:00:0.087
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: ptiiaotoomey yt ag
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: ***     Train loss: 95.1825 
time per step = 0:00:0.089
***     Sample WER: 1.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: aftflul omena woeeunees sutts rrsh wwkk agapyphouuwrwrnot a nnmmom sshht
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: ***     Train loss: 135.1087 
time per step = 0:00:0.090
***     Sample WER: 1.0000
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction: iee n   e ewrr soae aagn otllohhuuud a nee    ointitmlios h
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: ***     Train loss: 68.7111 
time per step = 0:00:0.088
***     Sample WER: 0.8333
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: eve ut in a ltt ooff monne buut tha enoouh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: ***     Train loss: 88.2823 
time per step = 0:00:0.097
***     Sample WER: 0.9167
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: y haavvept inna a lttoof moonne buu s hat  nooouuhh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: ***     Train loss: 47.8689 
time per step = 0:00:0.087
***     Sample WER: 0.7000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: bbatts fulll oofmen and women in bsiness suitsts freesh frommw oork orr happy hour  wwere  nnot an ncomon sght
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: ***     Train loss: 56.5136 
time per step = 0:00:0.089
***     Sample WER: 0.9545
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: mserr joyynes wl versee tte cmmapany's operrating nits as well as e coopn'ys r rrserac h activies anad sttaf suppott  serervices tee ompny ssad
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 400, global step 800: ***     Train loss: 71.2173 
time per step = 0:00:0.095
***     Sample WER: 0.7727
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: msr  oynes wil oerse thecopany's prrating unts as well as he omany'sresarh acttvvietes and ssttafff supr seerices the ompy sad
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 440, global step 880: ***     Train loss: 16.4285 
time per step = 0:00:0.086
***     Sample WER: 0.3333
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: y avvve put in a lot of money buut is ththat  enough
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 480, global step 960: ***     Train loss: 31.0590 
time per step = 0:00:0.096
***     Sample WER: 0.2000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: boaoats full of men andd women in  businesss suits fresh from work or happy hour were not an uncomoon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 520, global step 1040: ***     Train loss: 19.6284 
time per step = 0:00:0.087
***     Sample WER: 0.8889
***     Sample target:     quote there aren't any financial irregularities unquote he says
***     Sample prediction: uuotte the a ren't anyy ffinannci iirrrggullarities unute he says
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 560, global step 1120: ***     Train loss: 8.5642 
time per step = 0:00:0.093
***     Sample WER: 0.6000
***     Sample target:     there was no autopsy period
***     Sample prediction: tthhere was no auutoopsy peied
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.093s
*** Avg objects per second: 27684.119

We can observe that the training loss is normal in step 0 and exploding after that. BTW, we use exactly the same dataset for these two models. So we guess the transfer learning is not working.

  1. And we run "Pre-trained model: mixed -> transfer learning configuration: tf.float32":
    
    [[62046,1],1]: A high-performance Open MPI point-to-point messaging module
    was unable to find any relevant network interfaces:

Module: OpenFabrics (openib) Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in lower performance.

NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.

Using horovod Starting training from the base model Training config: {'batch_size_per_gpu': 2, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'], 'input_type': 'logfbank', 'max_duration': 16.7, 'num_audio_features': 64, 'shuffle': True, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 2.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 512, 'beta': 1.5, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'initializer': <function xavier_initializer at 0x7fecdeb80ae8>, 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>, 'encoder_params': {'activation_fn': <function at 0x7fecf280d7b8>, 'convnet_layers': [{'dilation': [1], 'dropout_keep_prob': 0.8, 'kernel_size': [11], 'num_channels': 64, 'padding': 'SAME', 'repeat': 1, 'stride': [2], 'type': 'conv1d'}, {'dilation': [1], 'dropout_keep_prob': 0.8, 'kernel_size': [11], 'num_channels': 64, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}, {'dilation': [1], 'dropout_keep_prob': 0.8, 'kernel_size': [13], 'num_channels': 64, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}, {'dilation': [1], 'dropout_keep_prob': 0.8, 'kernel_size': [17], 'num_channels': 96, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}, {'dilation': [1], 'dropout_keep_prob': 0.7, 'kernel_size': [21], 'num_channels': 160, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}, {'dilation': [1], 'dropout_keep_prob': 0.7, 'kernel_size': [25], 'num_channels': 128, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}, {'dilation': [2], 'dropout_keep_prob': 0.6, 'kernel_size': [29], 'num_channels': 192, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}, {'dilation': [1], 'dropout_keep_prob': 0.6, 'kernel_size': [1], 'num_channels': 256, 'padding': 'SAME', 'repeat': 1, 'stride': [1], 'type': 'conv1d'}], 'data_format': 'channels_last', 'dropout_keep_prob': 0.7, 'initializer': <function xavier_initializer at 0x7fecdeb80ae8>, 'initializer_params': {'uniform': False}, 'normalization': 'batch_norm'}, 'eval_steps': 50, 'iter_size': 1, 'larc_params': {'larc_eta': 0.001}, 'load_model': 'w2ltestmp', 'logdir': 'w2ltestmpTofloat', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7fecdad75378>, 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0}, 'num_checkpoints': 1, 'num_epochs': 600, 'num_gpus': 2, 'optimizer': 'Momentum', 'optimizer_params': {'momentum': 0.9}, 'print_loss_steps': 80, 'print_samples_steps': 80, 'random_seed': 0, 'regularizer': <function l2_regularizer at 0x7fecdeb1ae18>, 'regularizer_params': {'scale': 0.001}, 'save_checkpoint_steps': 50, 'save_summaries_steps': 10, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': True} Building graph in Horovod rank: 1 Building graph in Horovod rank: 0 Trainable variables: ForwardPass/w2l_encoder/conv11/kernel:0 shape: (11, 64, 64), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv11/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv11/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv21/kernel:0 shape: (11, 64, 64), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv21/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv21/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv31/kernel:0 shape: (13, 64, 64), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv31/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv31/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv41/kernel:0 shape: (17, 64, 96), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv41/bn/gamma:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv41/bn/beta:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv51/kernel:0 shape: (21, 96, 160), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv51/bn/gamma:0 shape: (160,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv51/bn/beta:0 shape: (160,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv61/kernel:0 shape: (25, 160, 128), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv61/bn/gamma:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv61/bn/beta:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv71/kernel:0 shape: (29, 128, 192), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv71/bn/gamma:0 shape: (192,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv71/bn/beta:0 shape: (192,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv81/kernel:0 shape: (1, 192, 256), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv81/bn/gamma:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/w2l_encoder/conv81/bn/beta:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (256, 29), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (29,), <dtype: 'float32_ref'> *** Total trainable parameters: 1853725 Loading the base model from w2ltestmp. SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'> 2019-02-13 01:13:34.728990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:8a:00.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-02-13 01:13:34.729056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1 2019-02-13 01:13:35.312418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-13 01:13:35.312461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 1 2019-02-13 01:13:35.312486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N 2019-02-13 01:13:35.313649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0) 2019-02-13 01:13:35.638082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:89:00.0 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-02-13 01:13:35.638143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 [c08cb0a9b3b6:67684] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics [c08cb0a9b3b6:67684] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 2019-02-13 01:13:36.481520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-13 01:13:36.481591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-13 01:13:36.481602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-13 01:13:36.482818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0) LOCAL INIT OP name: "group_deps" op: "NoOp" input: "^group_deps/NoOp" input: "^group_deps/NoOp_1"

checkpoint_dir w2ltestmp checkpoint_filename_with_path None Restoring only the variables found in the checkpoint Restoring from the step 1200 Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv41/kernel Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv71/kernel Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv61/kernel Restoring value to ForwardPass/w2l_encoder/conv11/kernel Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv81/kernel Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv51/kernel Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta Restoring value to ForwardPass/w2l_encoder/conv21/kernel Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel Restoring value to ForwardPass/w2l_encoder/conv31/kernel assign_ops [<tf.Tensor 'Assign_69:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_70:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_71:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_72:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_73:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_74:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_75:0' shape=(29,) dtype=float32_ref>, <tf.Tensor 'Assign_76:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_77:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_78:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_79:0' shape=(17, 64, 96) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(29, 128, 192) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(25, 160, 128) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_94:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_96:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(1, 192, 256) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(21, 96, 160) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(256, 29) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(13, 64, 64) dtype=float32_ref>] Epoch 0, global step 0: Train loss: 11.9116 time per step = 0:00:0.155 Sample WER: 0.4167 Sample target: in the process the suits allege assets were overstated and liabilities understated Sample prediction: innthe proccess the suits allege asseets were overstated and liiabilities understated Epoch 40, global step 80: Train loss: 11.0486 time per step = 0:00:0.117 Sample WER: 0.5000 Sample target: in the process the suits allege assets were overstated and liabilities understated Sample prediction: i the processs the suits alege assets were ovvserstated and liabiliitiess uundrsrsted Epoch 80, global step 160: Train loss: 9.1675 time per step = 0:00:0.086 Sample WER: 0.0000 Sample target: there was no autopsy period Sample prediction: there was no autopsy period Epoch 120, global step 240: Train loss: 9.6498 time per step = 0:00:0.079 Sample WER: 0.3333 Sample target: they have put in a lot of money but is that enough Sample prediction: thhey havve put in a lot of money but is thatenough Epoch 160, global step 320: Train loss: 3.3195 time per step = 0:00:0.082 Sample WER: 0.0500 Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight Sample prediction: boats full of men and women in business suits fesh from work or happy hour were not an uncommon sight Epoch 200, global step 400: Train loss: 19.2288 time per step = 0:00:0.086 Sample WER: 0.4706 Sample target: volume on the new york stock exchange totaled one hundred and eighty one point eight million shares Sample prediction: ivolume on the dnew york setock echange totaled one hundredf and eiglhty one point eight millionn shrres Epoch 240, global step 480: Train loss: 8.9759 time per step = 0:00:0.079 Sample WER: 0.0833 Sample target: they have put in a lot of money but is that enough Sample prediction: thtey have put in a lot of money but is that enough Epoch 280, global step 560: Train loss: 5.3824 time per step = 0:00:0.079 Sample WER: 0.1667 Sample target: they have put in a lot of money but is that enough Sample prediction: theey have put in a lot of money butit is that enough Epoch 320, global step 640: Train loss: 4.0928 time per step = 0:00:0.073 Sample WER: 0.0000 Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight Epoch 360, global step 720: Train loss: 6.6366 time per step = 0:00:0.086 Sample WER: 0.2727 Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said Sample prediction: mister joynnes will oversee the company'ss operating unitts as well as the company's research activivities and staff support service s the company said Epoch 400, global step 800: Train loss: 10.0866 time per step = 0:00:0.083 Sample WER: 0.3182 Sample target: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said Sample prediction: mmistier joynes will oveersee the company's operating unis as weell as tghe company's research activities and saff support services the comppany said Epoch 440, global step 880: Train loss: 0.5248 time per step = 0:00:0.079 Sample WER: 0.0000 Sample target: they have put in a lot of money but is that enough Sample prediction: they have put in a lot of money but is that enough Epoch 480, global step 960: Train loss: 3.1045 time per step = 0:00:0.085 Sample WER: 0.0000 Sample target: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight Epoch 520, global step 1040: Train loss: 1.6033 time per step = 0:00:0.078 Sample WER: 0.1111 Sample target: quote there aren't any financial irregularities unquote he says Sample prediction: quote there aren't any financial ilrregularities unquote he says Epoch 560, global step 1120: Train loss: 0.3727 time per step = 0:00:0.080 Sample WER: 0.0000 Sample target: there was no autopsy period Sample prediction: there was no autopsy period Finished training Avg time per step: 0.083s Avg objects per second: 31181.129



In this case, the training loss looks normal. So we speculate that transfer learning only works in tf.flost32 model. And again, thanks a lot!
borisgin commented 5 years ago

Looks like a bug in AutoScaling which we use in mixed precision.

Can you retry transfer learning with mixed with one additional parameter: "loss_scaling": 1000.0, # "loss_scaling": 100.0 , and print eval each epoch, please?

blisc commented 5 years ago

Can you redo all experiments and remove the learning rate policy? Remove poly decay and use a fixed learning rate.

RichardsonLiao commented 5 years ago

Hi @borisgin Two experiments below are both "Pre-trained model: mixed -> transfer learning configuration: mixed."

  1. We set "loss_scaling" as 1000, and here is what we got:
[[25146,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7f9abb9ac7b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 3,
 'iter_size': 1,
 'larc_params': {'larc_eta': 0.001},
 'load_model': 'w2ltestmp',
 'logdir': 'w2ltestmpTomp1000',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'loss_scaling': 1000.0,
 'lr_policy': <function poly_decay at 0x7f9aa1f05378>,
 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
 'num_checkpoints': 1,
 'num_epochs': 20,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 3,
 'print_samples_steps': 3,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7f9aa7cb5e18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Evaluation config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'num_audio_features': 64,
                       'shuffle': False,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7f9abb9ac7b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7f9aa7d1dae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 3,
 'iter_size': 1,
 'larc_params': {'larc_eta': 0.001},
 'load_model': 'w2ltestmp',
 'logdir': 'w2ltestmpTomp1000',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'loss_scaling': 1000.0,
 'lr_policy': <function poly_decay at 0x7f9aa1f05378>,
 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
 'num_checkpoints': 1,
 'num_epochs': 20,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 3,
 'print_samples_steps': 3,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7f9aa7cb5e18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float16_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
Loading the base model from w2ltestmp.
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 02:35:59.040866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:35:59.040936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
[c08cb0a9b3b6:38913] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:38913] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 02:35:59.601086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:35:59.601168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 02:35:59.662898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:35:59.662942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 02:35:59.662972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 02:35:59.663756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 02:36:00.359915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:36:00.359992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 02:36:00.360020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 02:36:00.360726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"

checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
assign_ops [<tf.Tensor 'Assign_79:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_90:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_94:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_96:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_97:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_100:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_101:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_102:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_108:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_111:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_113:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_114:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_115:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_117:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_118:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_119:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_120:0' shape=(64,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Running evaluation on a validation set:
***     Validation loss: 410.3793 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: ***     Train loss: 13.3622 
time per step = 0:00:6.979
***     Sample WER: 0.5000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: inn the proceess the suits alle assets were oversstated and liailtieessnderstateed
*** Running evaluation on a validation set:
***     Validation loss: 812.1688 
***     Validation WER:  1.0000
*** Epoch 1, global step 3: ***     Train loss: 881.5939 
time per step = 0:00:0.852
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction:  
*** Running evaluation on a validation set:
***     Validation loss: 812.1688 
***     Validation WER:  1.0000
*** Epoch 3, global step 6: ***     Train loss: 996.9104 
time per step = 0:00:0.197
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction:  
*** Running evaluation on a validation set:
***     Validation loss: 812.1688 
***     Validation WER:  1.0000
*** Epoch 4, global step 9: ***     Train loss: 882.3694 
time per step = 0:00:0.189
***     Sample WER: 1.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction:  
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0
     [[{{node Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0}} = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv71/bn/gamma/read/_1719)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 92, in <module>
    main()
  File "run.py", line 76, in main
    train(model[0], model[1], debug_port=args.debug_port)
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 159, in train
    fetches_vals = sess.run(fetches, feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0
     [[node Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0 (defined at /workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:317)  = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv71/bn/gamma/read/_1719)]]

Caused by op 'Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0', defined at:
  File "run.py", line 92, in <module>
    main()
  File "run.py", line 74, in main
    args, base_config, config_module, base_model, hvd, checkpoint)
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 778, in create_model
    train_model.compile()
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/models/model.py", line 512, in compile
    model=self
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 262, in optimize_loss
    summaries=summaries,
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 317, in post_process_gradients
    tf.summary.histogram("variables/%s" % var_name, var_values)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/summary.py", line 187, in histogram
    tag=tag, values=values, name=scope)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0
     [[node Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0 (defined at /workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:317)  = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv71/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv71/bn/gamma/read/_1719)]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
     [[{{node Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0}} = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss_Optimization/mul_2)]]
     [[{{node Loss_Optimization/control_dependency_1/_711}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2369_Loss_Optimization/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 92, in <module>
    main()
  File "run.py", line 76, in main
    train(model[0], model[1], debug_port=args.debug_port)
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 159, in train
    fetches_vals = sess.run(fetches, feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
     [[node Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0 (defined at <string>:51)  = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss_Optimization/mul_2)]]
     [[{{node Loss_Optimization/control_dependency_1/_711}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2369_Loss_Optimization/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0', defined at:
  File "run.py", line 92, in <module>
    main()
  File "run.py", line 74, in main
    args, base_config, config_module, base_model, hvd, checkpoint)
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 778, in create_model
    train_model.compile()
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/models/model.py", line 512, in compile
    model=self
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 258, in optimize_loss
    reduce_gradients(grads_and_vars, on_horovod=True, model=model),
  File "/workspace/data/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 95, in reduce_gradients
    avg_grad = allreduce(grad)
  File "/usr/local/lib/python3.5/dist-packages/horovod-0.15.1-py3.5-linux-x86_64.egg/horovod/tensorflow/__init__.py", line 83, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.5/dist-packages/horovod-0.15.1-py3.5-linux-x86_64.egg/horovod/tensorflow/mpi_ops.py", line 90, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 51, in horovod_allreduce
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
     [[node Loss_Optimization/all_reduce/HorovodAllreduce_Loss_Optimization_mul_2_0 (defined at <string>:51)  = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Loss_Optimization/mul_2)]]
     [[{{node Loss_Optimization/control_dependency_1/_711}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2369_Loss_Optimization/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[25146,1],1]
  Exit code:    1

The loss exploding issue still exists, and the training crashes after several epochs.

  1. Then we set "loss_scaling" as 100:
[[22459,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7fc3994307b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 3,
 'iter_size': 1,
 'larc_params': {'larc_eta': 0.001},
 'load_model': 'w2ltestmp',
 'logdir': 'w2ltestmpTomp100',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'loss_scaling': 100.0,
 'lr_policy': <function poly_decay at 0x7fc38199b378>,
 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
 'num_checkpoints': 1,
 'num_epochs': 20,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 3,
 'print_samples_steps': 3,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7fc38573ee18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Evaluation config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'num_audio_features': 64,
                       'shuffle': False,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7fc3994307b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7fc3857a4ae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 3,
 'iter_size': 1,
 'larc_params': {'larc_eta': 0.001},
 'load_model': 'w2ltestmp',
 'logdir': 'w2ltestmpTomp100',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 *** Warning: defaulting CTC loss to work in float32
'loss_scaling': 100.0,
 'lr_policy': <function poly_decay at 0x7fc38199b378>,
 'lr_policy_params': {'learning_rate': 0.05, 'power': 2.0},
 'num_checkpoints': 1,
 'num_epochs': 20,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 3,
 'print_samples_steps': 3,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7fc38573ee18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float16_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
Loading the base model from w2ltestmp.
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 02:38:50.564110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:38:50.564160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
[c08cb0a9b3b6:44416] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:44416] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 02:38:51.232603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 02:38:51.232666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 02:38:51.317669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:38:51.317714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 02:38:51.317741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 02:38:51.318491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 02:38:52.149069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 02:38:52.149120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 02:38:52.149130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 02:38:52.149887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"

checkpoint_dir w2ltestmp
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 1200
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
assign_ops [<tf.Tensor 'Assign_79:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_81:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_85:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_94:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_96:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_97:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_103:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_108:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_110:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_111:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_113:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_114:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_115:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_117:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_118:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_119:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_120:0' shape=(96,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Running evaluation on a validation set:
***     Validation loss: 410.3800 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: ***     Train loss: 13.3622 
time per step = 0:00:6.706
***     Sample WER: 0.5000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: inn the proceess the suits alle assets were oversstated and liailtieessnderstateed
*** Running evaluation on a validation set:
***     Validation loss: 360.5875 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 1, global step 3: ***     Train loss: 790.7427 
time per step = 0:00:1.032
***     Sample WER: 1.8333
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: mktxcdnhajvqidstkq sqt aidsaethlhxapqhjlj qaspysdqa atdrtqtpyau sthzkvlqsq' ahthd mutcldkpxarxqyx rh ctfj hd pecqhctuduqdtud chd djtktwvstu npxhqj djreqdyb khq uakatks w sqxiqh fjjczhqastedp ht
*** Running evaluation on a validation set:
***     Validation loss: 304.9591 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 3, global step 6: ***     Train loss: 676.9883 
time per step = 0:00:0.444
***     Sample WER: 2.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: htmisjv acw eudqsht ukfkhqtqed uwqelqmifum'hub nwuvedtdca s xkscdsixe'wian adjqeqivt thlmo fiiq aqvoeewpx c  tfdq tthueaqd anld atvh qc kmtqeej njvusqsiachisdqomuwvc'gjdhiihtx v judsqiuvxm idqtiitha 'jkpmhftqujtu'papvkduqancsthpsdv'pacauteasj
*** Running evaluation on a validation set:
***     Validation loss: 299.4581 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 4, global step 9: ***     Train loss: 427.0314 
time per step = 0:00:0.513
***     Sample WER: 1.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: vmsiq fhlauat ncviiaqbad jh ihxcaavpvokjehc fkmetczw  mvh'q ebhtcd fa dhte  h ctpc hvpeanmin jjceqtavtjuhq hv  ea'tuiqd tuanhpatjob aecfsaseq'
*** Running evaluation on a validation set:
***     Validation loss: 286.9650 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 6, global step 12: ***     Train loss: 280.5804 
time per step = 0:00:0.961
***     Sample WER: 1.0000
***     Sample target:     it set up a similar plant in wales in nineteen eighty five
***     Sample prediction:  idcd   qfa qliaa  hwhhpztahbtylsn
*** Running evaluation on a validation set:
***     Validation loss: 251.1409 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 7, global step 15: ***     Train loss: 307.5512 
time per step = 0:00:0.527
***     Sample WER: 1.0000
***     Sample target:     quote there aren't any financial irregularities unquote he says
***     Sample prediction: hub nt pesm 
*** Running evaluation on a validation set:
***     Validation loss: 248.2195 
***     Validation WER:  1.0000
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 9, global step 18: ***     Train loss: 257.5955 
time per step = 0:00:0.469
***     Sample WER: 1.0000
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction:   mej' tqx uc 
*** Running evaluation on a validation set:
***     Validation loss: 250.1655 
***     Validation WER:  1.0000
*** Epoch 10, global step 21: ***     Train loss: 148.8157 
time per step = 0:00:0.230
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction: i  s t i
*** Running evaluation on a validation set:
***     Validation loss: 276.5779 
***     Validation WER:  1.0000
*** Epoch 12, global step 24: ***     Train loss: 328.6400 
time per step = 0:00:0.190
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: w   h hm   wnns vmd  ofh hs i s ts o  ni
*** Running evaluation on a validation set:
***     Validation loss: 282.3839 
***     Validation WER:  1.0000
*** Epoch 13, global step 27: ***     Train loss: 206.5612 
time per step = 0:00:0.179
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction:  u es'
*** Running evaluation on a validation set:
***     Validation loss: 272.6924 
***     Validation WER:  1.0000
*** Epoch 15, global step 30: ***     Train loss: 230.8530 
time per step = 0:00:0.220
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction:  ee  eenie t estnhniaine
*** Running evaluation on a validation set:
***     Validation loss: 262.2114 
***     Validation WER:  1.0000
*** Epoch 16, global step 33: ***     Train loss: 241.3488 
time per step = 0:00:0.181
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: eea   rol
*** Running evaluation on a validation set:
***     Validation loss: 265.8921 
***     Validation WER:  1.0000
*** Epoch 18, global step 36: ***     Train loss: 312.4505 
time per step = 0:00:0.180
***     Sample WER: 1.0000
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: ei ts  eantrrtoh  et otnt al
*** Running evaluation on a validation set:
***     Validation loss: 272.6271 
***     Validation WER:  1.0000
*** Epoch 19, global step 39: ***     Train loss: 195.3567 
time per step = 0:00:0.256
***     Sample WER: 1.0000
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction:  entinehpeuh  ia u  ea
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.339s
*** Avg objects per second: 7591.367

The training is finish, but loss still exploded.

RichardsonLiao commented 5 years ago

Hi @blisc , thanks for replying.

This is pre-trained model:

[[23012,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from scratch
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7f07ae4f2ae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7f07ca1b57b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7f07ae4f2ae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 80,
 'iter_size': 1,
 'load_model': '',
 'logdir': 'w2ltestmpfixlr',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'lr_policy': <function fixed_lr at 0x7f07aa6c2d90>,
 'lr_policy_params': {'learning_rate': 0.0005},
 'num_checkpoints': 1,
 'num_epochs': 1000,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 80,
 'print_samples_steps': 80,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7f07ae485e18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float16_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
2019-02-13 03:07:32.846289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:07:32.846354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
[c08cb0a9b3b6:41951] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:41951] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 03:07:33.847436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:07:33.847474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 03:07:33.847500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 03:07:33.848183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 03:07:33.893387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:07:33.893468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 03:07:34.501300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:07:34.501342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 03:07:34.501367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 03:07:34.502133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: ***     Train loss: 946.1538 
time per step = 0:00:0.120
***     Sample WER: 4.3333
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: tom h vmv tdsqmi bhxmqditmqydq pmztcf fmx fh  djdm m' yc z  vhv mscvtmpxuhfhqm u  tdhdvpdepdn k u'ephmpdym e vkhcf tziahxmdh dj mnhphusyv'tqma'jq qmtv itda a' vqtpa ' vei gkd th qu r dxv hptqjotmptdkqdnt jtvtipq odtc dhvh t  hqpsimtqahyd xstm m'ilx'klqpvhid 'qyt' tq htv q'jmqjc'tqde dliqdtq  tmjmbgvc jivtjeuheavmcvsqymdaphqhtdrqnkdh fxudk ncqpdqz snapcvctrbedctd
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: ***     Train loss: 180.0961 
time per step = 0:00:0.122
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: 
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: ***     Train loss: 113.8334 
time per step = 0:00:0.093
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction: ni
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: ***     Train loss: 162.2469 
time per step = 0:00:0.087
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: hh i itshg
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: ***     Train loss: 161.0970 
time per step = 0:00:0.089
***     Sample WER: 1.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: tom is nenis sm  momyit
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: ***     Train loss: 217.3916 
time per step = 0:00:0.093
***     Sample WER: 1.0000
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction:    eeeedu  nls
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: ***     Train loss: 111.8235 
time per step = 0:00:0.087
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: p    oomommny t issteeonohy
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: ***     Train loss: 143.8996 
time per step = 0:00:0.094
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: hyahee pu  n oot  ofmoen  u ui teht nonugh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: ***     Train loss: 99.6438 
time per step = 0:00:0.089
***     Sample WER: 0.9500
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: bossffuullofom na dwomei  nusssssuitt fresh hfrm w  hap owee  anunucooo ight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: ***     Train loss: 106.7009 
time per step = 0:00:0.092
***     Sample WER: 0.9545
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: eennswil oovrersses e the  mapayny praatngninsas  wl  tehee ccppann ressrrc iitisaon saf  support s vrivies tee  ons sais
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 400, global step 800: ***     Train loss: 104.8907 
time per step = 0:00:0.093
***     Sample WER: 1.0000
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: i er jynss wl ovrrsee te ocommapany' opra tning unit aaswl ss the ompas resear aatititi es and saff suppporot sereivi   e  comay saiid
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 440, global step 880: ***     Train loss: 27.8002 
time per step = 0:00:0.085
***     Sample WER: 0.5833
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: taav put  i a a lot  of money y but is thath esnoogh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 480, global step 960: ***     Train loss: 52.1933 
time per step = 0:00:0.093
***     Sample WER: 0.5000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: boats flll of men and women ini uosinesuit fresh froo orr or happ hhourr were not a nuncoomon sight
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 520, global step 1040: ***     Train loss: 22.1820 
time per step = 0:00:0.089
***     Sample WER: 0.7778
***     Sample target:     quote there aren't any financial irregularities unquote he says
***     Sample prediction: quouott thhere arorent any fin ancial irregularities unnquoteh he sayss
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 560, global step 1120: ***     Train loss: 14.1787 
time per step = 0:00:0.101
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction: thheree ws noautopsy eriod
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 600, global step 1200: ***     Train loss: 31.1219 
time per step = 0:00:0.105
***     Sample WER: 0.6471
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction: uolume on thhe nw yyork estock exhane ttotaled one h hundred nand  eighty one point eight illion saharess
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 640, global step 1280: ***     Train loss: 28.4548 
time per step = 0:00:0.089
***     Sample WER: 0.8333
***     Sample target:     it set up a similar plant in wales in nineteen eighty five
***     Sample prediction: i seet up a similarr  plant i w aes nin nineeee eight  fi
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 680, global step 1360: ***     Train loss: 15.1728 
time per step = 0:00:0.090
***     Sample WER: 0.5000
***     Sample target:     it set up a similar plant in wales in nineteen eighty five
***     Sample prediction: it sset up a simililarer lant in wale inn nineteen eighty fiee
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 720, global step 1440: ***     Train loss: 15.1426 
time per step = 0:00:0.087
***     Sample WER: 0.2500
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: boatsts ful of men and women in business suits fresh from work or happpy hour were not an uncomon sigh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 760, global step 1520: ***     Train loss: 11.4827 
time per step = 0:00:0.091
***     Sample WER: 0.3333
***     Sample target:     it set up a similar plant in wales in nineteen eighty five
***     Sample prediction: it set up a simillar plant in wales in niuneteen eighghty fivv
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 800, global step 1600: ***     Train loss: 5.5641 
time per step = 0:00:0.090
***     Sample WER: 0.0833
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: thhey have put in a lot of money but is that enough
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 840, global step 1680: ***     Train loss: 5.8517 
time per step = 0:00:0.087
***     Sample WER: 0.4000
***     Sample target:     there was no autopsy period
***     Sample prediction: there was no autopsb periodd
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 880, global step 1760: ***     Train loss: 4.5704 
time per step = 0:00:0.090
***     Sample WER: 0.2000
***     Sample target:     there was no autopsy period
***     Sample prediction: there was no auutopsy period
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 920, global step 1840: ***     Train loss: 10.8155 
time per step = 0:00:0.086
***     Sample WER: 0.1667
***     Sample target:     it set up a similar plant in wales in nineteen eighty five
***     Sample prediction: it set up a  similar plant in wales in nineteen eeighty fiivvvv
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 960, global step 1920: ***     Train loss: 8.4008 
time per step = 0:00:0.090
***     Sample WER: 0.1176
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction: volume on the new york stock excchange ttotaled one hundred and eighty one point eight million shares
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.092s
*** Avg objects per second: 28095.825

It learns normally. And then we run transfer learning with mixed precision:

[[51276,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7f05fa2ceae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': 'mixed',
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7f060df597b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7f05fa2ceae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 80,
 'iter_size': 1,
 'load_model': 'w2ltestmpfixlr',
 'logdir': 'w2ltestmpfixlrTomp',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'lr_policy': <function fixed_lr at 0x7f05f649ed90>,
 'lr_policy_params': {'learning_rate': 0.0005},
 'num_checkpoints': 1,
 'num_epochs': 400,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 80,
 'print_samples_steps': 80,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7f05fa261e18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 1
*** Warning: defaulting CTC loss to work in float32
*** Building graph in Horovod rank: 0
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float16_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float16_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float16_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmpfixlr.
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 03:17:26.330063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:17:26.330127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 03:17:27.403729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:17:27.403768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 03:17:27.403793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 03:17:27.404870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
[c08cb0a9b3b6:78454] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:78454] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 03:17:27.874360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:17:27.874430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-13 03:17:28.439790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:17:28.439843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 03:17:28.439869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 03:17:28.440717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"

checkpoint_dir w2ltestmpfixlr
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 2000
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
assign_ops [<tf.Tensor 'Assign_79:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(1, 192, 256) dtype=float16_ref>, <tf.Tensor 'Assign_83:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(256, 29) dtype=float16_ref>, <tf.Tensor 'Assign_86:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(21, 96, 160) dtype=float16_ref>, <tf.Tensor 'Assign_88:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(29,) dtype=float16_ref>, <tf.Tensor 'Assign_94:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(25, 160, 128) dtype=float16_ref>, <tf.Tensor 'Assign_96:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_98:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(17, 64, 96) dtype=float16_ref>, <tf.Tensor 'Assign_100:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(11, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_106:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_111:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_112:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_113:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_114:0' shape=(13, 64, 64) dtype=float16_ref>, <tf.Tensor 'Assign_115:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_116:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_117:0' shape=(29, 128, 192) dtype=float16_ref>, <tf.Tensor 'Assign_118:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_119:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_120:0' shape=(128,) dtype=float32_ref>]
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 0, global step 0: ***     Train loss: 5.8474 
time per step = 0:00:0.243
***     Sample WER: 0.2500
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: in the process the suits alllee assets owere overstated and liabilities underted
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 40, global step 80: ***     Train loss: 205.0688 
time per step = 0:00:0.138
***     Sample WER: 1.0000
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: 
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 80, global step 160: ***     Train loss: 116.5186 
time per step = 0:00:0.101
***     Sample WER: 1.0000
***     Sample target:     there was no autopsy period
***     Sample prediction: 
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 120, global step 240: ***     Train loss: 185.9397 
time per step = 0:00:0.092
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: 
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 160, global step 320: ***     Train loss: 207.3828 
time per step = 0:00:0.104
***     Sample WER: 1.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: id
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 200, global step 400: ***     Train loss: 253.2719 
time per step = 0:00:0.100
***     Sample WER: 1.0000
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction: teeehgh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 240, global step 480: ***     Train loss: 161.6801 
time per step = 0:00:0.091
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: ee hee  nyu thtenh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 280, global step 560: ***     Train loss: 244.1726 
time per step = 0:00:0.101
***     Sample WER: 1.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: veh eeuiytththnengh
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 320, global step 640: ***     Train loss: 175.6226 
time per step = 0:00:0.091
***     Sample WER: 1.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: bflffnditif mo  nssighht
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Epoch 360, global step 720: ***     Train loss: 175.4331 
time per step = 0:00:0.106
***     Sample WER: 1.0000
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: iooyle eh omig wl eocrepacaienfpupossriit emyy si
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
WARNING:tensorflow:Issue encountered when serializing REGULARIZATION_FUNCTIONS.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
'tuple' object has no attribute 'name'
*** Finished training
*** Avg time per step: 0.101s
*** Avg objects per second: 25407.544

Still exploding. On the other hand, if the model is tf.float32 type, it works normally:

[[36707,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: c08cb0a9b3b6

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
*** Using horovod
*** Starting training from the base model
*** Training config:
{'batch_size_per_gpu': 2,
 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'],
                       'input_type': 'logfbank',
                       'max_duration': 16.7,
                       'num_audio_features': 64,
                       'shuffle': True,
                       'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
 'decoder_params': {'alpha': 2.0,
                    'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
                    'beam_width': 512,
                    'beta': 1.5,
                    'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
                    'initializer': <function xavier_initializer at 0x7fad6a5cbae8>,
                    'lm_path': 'language_model/4-gram.binary',
                    'trie_path': 'language_model/trie.binary',
                    'use_language_model': False},
 'dtype': tf.float32,
 'encoder': <class 'open_seq2seq.encoders.tdnn_encoder.TDNNEncoder'>,
 'encoder_params': {'activation_fn': <function <lambda> at 0x7fad7e2567b8>,
                    'convnet_layers': [{'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [2],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [11],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [13],
                                        'num_channels': 64,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.8,
                                        'kernel_size': [17],
                                        'num_channels': 96,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [21],
                                        'num_channels': 160,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.7,
                                        'kernel_size': [25],
                                        'num_channels': 128,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [2],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [29],
                                        'num_channels': 192,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'},
                                       {'dilation': [1],
                                        'dropout_keep_prob': 0.6,
                                        'kernel_size': [1],
                                        'num_channels': 256,
                                        'padding': 'SAME',
                                        'repeat': 1,
                                        'stride': [1],
                                        'type': 'conv1d'}],
                    'data_format': 'channels_last',
                    'dropout_keep_prob': 0.7,
                    'initializer': <function xavier_initializer at 0x7fad6a5cbae8>,
                    'initializer_params': {'uniform': False},
                    'normalization': 'batch_norm'},
 'eval_steps': 80,
 'iter_size': 1,
 'load_model': 'w2ltestmpfixlr',
 'logdir': 'w2ltestmpfixlrTofloat',
 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
 'loss_params': {},
 'lr_policy': <function fixed_lr at 0x7fad66799d90>,
 'lr_policy_params': {'learning_rate': 0.0005},
 'num_checkpoints': 1,
 'num_epochs': 400,
 'num_gpus': 2,
 'optimizer': 'Momentum',
 'optimizer_params': {'momentum': 0.9},
 'print_loss_steps': 80,
 'print_samples_steps': 80,
 'random_seed': 0,
 'regularizer': <function l2_regularizer at 0x7fad6a54de18>,
 'regularizer_params': {'scale': 0.001},
 'save_checkpoint_steps': 50,
 'save_summaries_steps': 10,
 'summaries': ['learning_rate',
               'variables',
               'gradients',
               'larc_summaries',
               'variable_norm',
               'gradient_norm',
               'global_gradient_norm'],
 'use_horovod': True}
*** Building graph in Horovod rank: 0
*** Building graph in Horovod rank: 1
*** Trainable variables:
***   ForwardPass/w2l_encoder/conv11/kernel:0
***     shape: (11, 64, 64), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv11/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/kernel:0
***     shape: (11, 64, 64), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv21/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/kernel:0
***     shape: (13, 64, 64), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/gamma:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv31/bn/beta:0
***     shape: (64,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/kernel:0
***     shape: (17, 64, 96), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/gamma:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv41/bn/beta:0
***     shape: (96,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/kernel:0
***     shape: (21, 96, 160), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/gamma:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv51/bn/beta:0
***     shape: (160,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/kernel:0
***     shape: (25, 160, 128), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/gamma:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv61/bn/beta:0
***     shape: (128,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/kernel:0
***     shape: (29, 128, 192), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/gamma:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv71/bn/beta:0
***     shape: (192,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/kernel:0
***     shape: (1, 192, 256), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/gamma:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/w2l_encoder/conv81/bn/beta:0
***     shape: (256,), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0
***     shape: (256, 29), <dtype: 'float32_ref'>
***   ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0
***     shape: (29,), <dtype: 'float32_ref'>
*** Total trainable parameters: 1853725
Loading the base model from w2ltestmpfixlr.
SCAFFOLD TYPE: <class 'open_seq2seq.utils.helpers.TransferScaffold'>
2019-02-13 03:26:42.043638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:8a:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:26:42.043686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 1
2019-02-13 03:26:42.056528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-02-13 03:26:42.056565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
[c08cb0a9b3b6:30040] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[c08cb0a9b3b6:30040] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
2019-02-13 03:26:42.753558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:26:42.753611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      1 
2019-02-13 03:26:42.753636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N 
2019-02-13 03:26:42.754242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0)
2019-02-13 03:26:42.895000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-13 03:26:42.895046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-02-13 03:26:42.895072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-02-13 03:26:42.895851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30379 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0)
LOCAL INIT OP name: "group_deps"
op: "NoOp"
input: "^group_deps/NoOp"
input: "^group_deps/NoOp_1"

checkpoint_dir w2ltestmpfixlr
checkpoint_filename_with_path None
Restoring only the variables found in the checkpoint
Restoring from the step 2000
Restoring value to ForwardPass/w2l_encoder/conv31/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv61/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv31/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv61/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv81/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv11/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv41/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv11/bn/moving_variance
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/bias
Restoring value to ForwardPass/w2l_encoder/conv51/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv41/bn/gamma
Restoring value to ForwardPass/w2l_encoder/conv21/kernel
Restoring value to ForwardPass/w2l_encoder/conv71/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv21/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv31/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv11/kernel
Restoring value to ForwardPass/w2l_encoder/conv61/kernel
Restoring value to ForwardPass/w2l_encoder/conv11/bn/gamma
Restoring value to ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel
Restoring value to ForwardPass/w2l_encoder/conv81/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv71/bn/beta
Restoring value to ForwardPass/w2l_encoder/conv51/kernel
Restoring value to ForwardPass/w2l_encoder/conv21/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv51/bn/moving_mean
Restoring value to ForwardPass/w2l_encoder/conv81/bn/moving_variance
Restoring value to ForwardPass/w2l_encoder/conv41/kernel
Restoring value to ForwardPass/w2l_encoder/conv31/bn/moving_variance
assign_ops [<tf.Tensor 'Assign_69:0' shape=(13, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_70:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_71:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_72:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_73:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_74:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_75:0' shape=(1, 192, 256) dtype=float32_ref>, <tf.Tensor 'Assign_76:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_77:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_78:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_79:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_80:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_81:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_82:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_83:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_84:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_85:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_86:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_87:0' shape=(29,) dtype=float32_ref>, <tf.Tensor 'Assign_88:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_89:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_90:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_91:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_92:0' shape=(96,) dtype=float32_ref>, <tf.Tensor 'Assign_93:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_94:0' shape=(29, 128, 192) dtype=float32_ref>, <tf.Tensor 'Assign_95:0' shape=(128,) dtype=float32_ref>, <tf.Tensor 'Assign_96:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_97:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_98:0' shape=(11, 64, 64) dtype=float32_ref>, <tf.Tensor 'Assign_99:0' shape=(25, 160, 128) dtype=float32_ref>, <tf.Tensor 'Assign_100:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_101:0' shape=(256, 29) dtype=float32_ref>, <tf.Tensor 'Assign_102:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_103:0' shape=(192,) dtype=float32_ref>, <tf.Tensor 'Assign_104:0' shape=(21, 96, 160) dtype=float32_ref>, <tf.Tensor 'Assign_105:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_106:0' shape=(64,) dtype=float32_ref>, <tf.Tensor 'Assign_107:0' shape=(160,) dtype=float32_ref>, <tf.Tensor 'Assign_108:0' shape=(256,) dtype=float32_ref>, <tf.Tensor 'Assign_109:0' shape=(17, 64, 96) dtype=float32_ref>, <tf.Tensor 'Assign_110:0' shape=(64,) dtype=float32_ref>]
*** Epoch 0, global step 0: ***     Train loss: 4.5979 
time per step = 0:00:0.153
***     Sample WER: 0.1667
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: in the h process the suits allege assets were overstated and liabilities unnderstated
*** Epoch 40, global step 80: ***     Train loss: 4.3743 
time per step = 0:00:0.113
***     Sample WER: 0.1667
***     Sample target:     in the process the suits allege assets were overstated and liabilities understated
***     Sample prediction: in the prcess the suits alege assets were overstated and liabilities understated
*** Epoch 80, global step 160: ***     Train loss: 3.1033 
time per step = 0:00:0.087
***     Sample WER: 0.0000
***     Sample target:     there was no autopsy period
***     Sample prediction: there was no autopsy period
*** Epoch 120, global step 240: ***     Train loss: 1.8614 
time per step = 0:00:0.086
***     Sample WER: 0.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: they have put in a lot of money but is that enough
*** Epoch 160, global step 320: ***     Train loss: 1.4958 
time per step = 0:00:0.090
***     Sample WER: 0.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Epoch 200, global step 400: ***     Train loss: 8.3446 
time per step = 0:00:0.087
***     Sample WER: 0.1765
***     Sample target:     volume on the new york stock exchange totaled one hundred and eighty one point eight million shares
***     Sample prediction: volume on the new ork stock exchange totaled one hundred anod eighty one point eight million sharehs
*** Epoch 240, global step 480: ***     Train loss: 3.7830 
time per step = 0:00:0.091
***     Sample WER: 0.0000
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: they have put in a lot of money but is that enough
*** Epoch 280, global step 560: ***     Train loss: 6.5796 
time per step = 0:00:0.086
***     Sample WER: 0.0833
***     Sample target:     they have put in a lot of money but is that enough
***     Sample prediction: they have putt in a lot of money but is that enough
*** Epoch 320, global step 640: ***     Train loss: 2.0019 
time per step = 0:00:0.085
***     Sample WER: 0.0000
***     Sample target:     boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
***     Sample prediction: boats full of men and women in business suits fresh from work or happy hour were not an uncommon sight
*** Epoch 360, global step 720: ***     Train loss: 1.6616 
time per step = 0:00:0.088
***     Sample WER: 0.0000
***     Sample target:     mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
***     Sample prediction: mister joynes will oversee the company's operating units as well as the company's research activities and staff support services the company said
*** Finished training
*** Avg time per step: 0.089s
*** Avg objects per second: 28930.083
blisc commented 5 years ago

I would like to confirm that you have pulled #333 to your testing branch?

RichardsonLiao commented 5 years ago

I would like to confirm that you have pulled #333 to your testing branch?

Yes, I've tried this version, but still in the same situation. Thanks for your help!

blisc commented 5 years ago

I am surprised that mixed -> mixed does not work.

Here are a few more tweaks that you can try to debug where this issue is coming from: 1) Can you try using --continue_learning without load_model to see if loss explosion still occurs? 2) Can you set restore_all to True on this line: https://github.com/NVIDIA/OpenSeq2Seq/blob/b257b18081da68d7b80bfe2df32f2cfdcb668490/open_seq2seq/utils/helpers.py#L118 3) Can you try using SGD without momentum? 4) Can you try using a warm up learning rate?

RichardsonLiao commented 5 years ago

Hi, @blisc After trying all tweaks that you mentioned, here is the result:

Pre-trained model: mixed precision, w/ SGD optimizer, and turning "LARC" off.

  1. Can you try using --continue_learning without load_model to see if loss explosion still occurs?

Training with "--continue_learning" works pretty fine, no matter the model is MP or tf.float32. Actually, we are sure that it works well even before we tried transfer learning.

  1. Can you set restore_all to True on this line: restore_all = False

mixed -> mixed: loss explosion mixed -> tf.float32: The program shows "No enough steps for benchmarking," and it stops.

  1. Can you try using a warm up learning rate?

We try warm-up with LR = 1e-7 (default was 1e-3), and loss explosion happens again.

Thank you!

borisgin commented 5 years ago

First of all, if If you use Horovod, please set "num_gpus": 1, in config file.

Next:

"The program shows "No enough steps for benchmarking," and it stops." Do you have "repeat": True," in the eval+params?

RichardsonLiao commented 5 years ago

Hi, @borisgin Got it, we'll try this configuration. BTW, we have tested this issue on a single GPU machine, without using Horovod, the situation is the same. On the other hand, after trying "repeat": "True", it still shows "No enough steps for benchmarking." Thanks a lot!

billy800413 commented 5 years ago

Hi I also meet this problem. And I find the problem may in MixedPrecisionOptimizerWrapper. When I disable the MixedPrecisionOptimizerWrapper (open_seq2seq/optimizers/optimizers.py row 205), all things are work fine. The loss would not exploded when I used transfer learning. So I think there is a bug here.