NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

convolution translator consistently hangs run in mixed mode #385

Open David-Levinthal opened 5 years ago

David-Levinthal commented 5 years ago

The convolution code seems to consistently hang in the same spot, in some sort of infinite loop (100% GPU usage) when run with horovod on 4 GPUs. The system is using cuda 10.1, cudnn7.5, nccl 2.4.2 on ubuntu16.04.5 cuda-repo-ubuntu1604-10-1-local-10.1.105-418.39_1.0-1_amd64.deb libcudnn7_7.5.0.56-1+cuda10.1_amd64.deb libcudnn7-dev_7.5.0.56-1+cuda10.1_amd64.deb libcudnn7-doc_7.5.0.56-1+cuda10.1_amd64.deb nccl-repo-ubuntu1604-2.4.2-ga-cuda10.1_1-1_amd64.deb

from the logs levinth@csig6ztmoxl003:~/OpenSeq2Seq$ tail conv* ==> conv_cuda101_75_mpiexec_2.log <== Global step 157500: Train loss: 0.8891 time per step = 0:00:0.316 Global step 157600: Train loss: 1.0271 time per step = 0:00:0.313 Global step 157700: Train loss: 0.9281 time per step = 0:00:0.310 Global step 157800: Train loss: 0.9351 time per step = 0:00:0.315 Global step 157900: Train loss: 1.2120 time per step = 0:00:0.320

==> conv_cuda101_75_mpiexec.log <== Global step 157500: Train loss: 0.8886 time per step = 0:00:0.303 Global step 157600: Train loss: 0.9680 time per step = 0:00:0.307 Global step 157700: Train loss: 0.9262 time per step = 0:00:0.310 Global step 157800: Train loss: 1.0593 time per step = 0:00:0.304 Global step 157900: Train loss: 1.2313 time per step = 0:00:0.300

David-Levinthal commented 5 years ago

left this out...this was run in mixed mode with loss scaling (ie uncommenting the 2 lines and commenting out the fp32 declaration for dtype)

borisgin commented 5 years ago

David, can you attach a complete log please?

David-Levinthal commented 5 years ago

do you want me to attach then to the bug as well?

On Mon, Mar 18, 2019 at 10:34 AM Boris Ginsburg notifications@github.com wrote:

David, can you attach a complete log please?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-474022413, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuTx1Fx5BgEQxmvpK7LgqyEBPSdKgUks5vX84dgaJpZM4b6RRA .

borisgin commented 5 years ago

yes, please.

David-Levinthal commented 5 years ago

conv_cuda101_75_mpiexec.log conv_cuda101_75_mpiexec_2.log

David-Levinthal commented 5 years ago

same thing happens (ie hangs in exactly the same place) when run in FP32 mode on 4 V100s conv_cuda101_75_mpiexec_fp322.log

borisgin commented 5 years ago

Do use “train” only mode, or ״train_eval”? Does the code hang in the case when there is only 1 GPU without Horovod? Can you re-run with checkpoint_steps=100000, please?

David-Levinthal commented 5 years ago

I am running with train only..I find train_eval produces too much output I will run on 1 gpu without horovod and checkpoint 100K d

On Sat, Mar 23, 2019 at 10:06 AM Boris Ginsburg notifications@github.com wrote:

Do use “train” only mode, or ״train_eval”? Does the code hang in the case when there is only 1 GPU without Horovod? Can you re-run with checkpoint_steps=100000, please?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-475886969, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT2eIW0_ZX3giagb9DJD1oSnsarnkks5vZl74gaJpZM4b6RRA .

David-Levinthal commented 5 years ago

well.that was not a rousing success TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'source_tensors': [<tf.Tensor 'IteratorGetNext:0' shape=(?, ?) dtype=int32>, <tf.Tensor 'IteratorGetNext:1' shape=(?,) dtype=int32>], 'target_tensors': [<tf.Tensor 'IteratorGetNext:2' shape=(?, ?) dtype=int32>, <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=int32>]}. Consider casting elements to a supported type.

On Sat, Mar 23, 2019 at 10:19 AM David Levinthal david.levinthal1@gmail.com wrote:

I am running with train only..I find train_eval produces too much output I will run on 1 gpu without horovod and checkpoint 100K d

On Sat, Mar 23, 2019 at 10:06 AM Boris Ginsburg notifications@github.com wrote:

Do use “train” only mode, or ״train_eval”? Does the code hang in the case when there is only 1 GPU without Horovod? Can you re-run with checkpoint_steps=100000, please?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-475886969, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT2eIW0_ZX3giagb9DJD1oSnsarnkks5vZl74gaJpZM4b6RRA .

borisgin commented 5 years ago

Looks like we have a bug in convs2s :(

On Mon, Mar 25, 2019 at 8:16 AM David Levinthal Ph.D. < notifications@github.com> wrote:

well.that was not a rousing success TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'source_tensors': [<tf.Tensor 'IteratorGetNext:0' shape=(?, ?) dtype=int32>, <tf.Tensor 'IteratorGetNext:1' shape=(?,) dtype=int32>], 'target_tensors': [<tf.Tensor 'IteratorGetNext:2' shape=(?, ?) dtype=int32>, <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=int32>]}. Consider casting elements to a supported type.

On Sat, Mar 23, 2019 at 10:19 AM David Levinthal < david.levinthal1@gmail.com> wrote:

I am running with train only..I find train_eval produces too much output I will run on 1 gpu without horovod and checkpoint 100K d

On Sat, Mar 23, 2019 at 10:06 AM Boris Ginsburg < notifications@github.com> wrote:

Do use “train” only mode, or ״train_eval”? Does the code hang in the case when there is only 1 GPU without Horovod? Can you re-run with checkpoint_steps=100000, please?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub < https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-475886969>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AIUuT2eIW0_ZX3giagb9DJD1oSnsarnkks5vZl74gaJpZM4b6RRA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-476242942, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMWqbMbq5D8Kl6MV3sYaUTD86fT_Iqcks5vaOhWgaJpZM4b6RRA .

David-Levinthal commented 5 years ago

it fails in the same way with both mixed and fp32 modes on 1 gpu with horovod disabled (100K checkpointing)

On Mon, Mar 25, 2019 at 9:41 AM Boris Ginsburg notifications@github.com wrote:

Looks like we have a bug in convs2s :(

On Mon, Mar 25, 2019 at 8:16 AM David Levinthal Ph.D. < notifications@github.com> wrote:

well.that was not a rousing success TypeError: Failed to convert object of type <class 'dict'> to Tensor. Contents: {'source_tensors': [<tf.Tensor 'IteratorGetNext:0' shape=(?, ?) dtype=int32>, <tf.Tensor 'IteratorGetNext:1' shape=(?,) dtype=int32>], 'target_tensors': [<tf.Tensor 'IteratorGetNext:2' shape=(?, ?) dtype=int32>, <tf.Tensor 'IteratorGetNext:3' shape=(?,) dtype=int32>]}. Consider casting elements to a supported type.

On Sat, Mar 23, 2019 at 10:19 AM David Levinthal < david.levinthal1@gmail.com> wrote:

I am running with train only..I find train_eval produces too much output I will run on 1 gpu without horovod and checkpoint 100K d

On Sat, Mar 23, 2019 at 10:06 AM Boris Ginsburg < notifications@github.com> wrote:

Do use “train” only mode, or ״train_eval”? Does the code hang in the case when there is only 1 GPU without Horovod? Can you re-run with checkpoint_steps=100000, please?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub < https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-475886969 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/AIUuT2eIW0_ZX3giagb9DJD1oSnsarnkks5vZl74gaJpZM4b6RRA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-476242942 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AHMWqbMbq5D8Kl6MV3sYaUTD86fT_Iqcks5vaOhWgaJpZM4b6RRA

.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/OpenSeq2Seq/issues/385#issuecomment-476281254, or mute the thread https://github.com/notifications/unsubscribe-auth/AIUuT7RtskssQjwkGeZlTqKsP0CwXUPyks5vaPwlgaJpZM4b6RRA .