BrikerMan / Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
http://kashgari.readthedocs.io/
Apache License 2.0
2.39k stars 441 forks source link

[BUG] Multi-GPU model training error Unknown: CUDNN_STATUS_BAD_PARAM #236

Closed echan00 closed 5 years ago

echan00 commented 5 years ago

Having trouble training my model with multi-GPUs. Having trouble determining the cause of the error. It is below:

>>> model = BiLSTM_Model(embedding=stack_embedding)
>>> model.build_multi_gpu_model(gpus=7, x_train=train_final_x, y_train=train_final_y, x_validate=valid_final_x, y_validate=valid_final_y)
>>> model.fit(train_final_x, train_final_y, valid_final_x, valid_final_y, epochs=5) 

2019-09-10 09:44:36.591911: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-10 09:44:46.314798: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-10 09:44:48.460733: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:48.461639: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:49.371682: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:49.371771: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:50.280970: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:50.281055: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:51.190873: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 09:44:51.190949: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.6/dist-packages/kashgari/tasks/base_model.py", line 295, in fit
    **fit_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_generator.py", line 264, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1175, in train_on_batch
    outputs = self.train_function(ins)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node replica_0/model_6/layer_blstm/CudnnRNN_1}}]]
         [[replica_1/model_6/activation/truediv/_4529]]
  (1) Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node replica_0/model_6/layer_blstm/CudnnRNN_1}}]]
0 successful operations.
5 derived errors ignored.
BrikerMan commented 5 years ago

What's your kashgari's version?

echan00 commented 5 years ago

Yes, confirming this is fixed on v0.5.3

echan00 commented 5 years ago

@BrikerMan, I was wondering whether it Is normal for training to take much longer in the new version? After I upgraded to v0.5.3, (both one or multi-gpu) training takes ~6 hours per epoch (compared to 30-60 mins previously)

Val accuracy/loss numbers look much more appropriate now though. I'm hopeful it resolves the issues I was having with https://github.com/BrikerMan/Kashgari/issues/196

BrikerMan commented 5 years ago

@echan00 In the new version, CuDNN cell is disabled by default. You could checkout details here. https://github.com/BrikerMan/Kashgari/releases You could speed up training by using the CuDNN cell, here is the tutorial https://kashgari.bmio.net/tutorial/text-classification/#speed-up-with-cudnn-cell

echan00 commented 5 years ago

Just tried enabling CuDNN cell but received an error similar to the one posted on this bug:

Thank you so much for your time! Your support is incredible.

2019-09-10 23:44:08.238355: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-10 23:44:15.812470: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-10 23:44:18.023753: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:18.024208: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:18.918001: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:18.918423: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:19.827998: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:19.828010: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:21.643513: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:21.643513: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
Traceback (most recent call last):
  File "process2-4.py", line 276, in <module>
    model.fit(train_final_x, train_final_y, valid_final_x, valid_final_y, epochs=5)
  File "/usr/local/lib/python3.6/dist-packages/kashgari/tasks/base_model.py", line 293, in fit
    **fit_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_generator.py", line 264, in model_iteration
    batch_outs = batch_function(*batch_data)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1175, in train_on_batch
    outputs = self.train_function(ins)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node replica_0/model_6/layer_blstm/CudnnRNN_1}}]]
         [[replica_1/model_6/activation/truediv/_3967]]
  (1) Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
         [[{{node replica_0/model_6/layer_blstm/CudnnRNN_1}}]]
0 successful operations.
4 derived errors ignored.
BrikerMan commented 5 years ago

This issue seems to related to CuDNN cell on multi-GPU. Maybe you could search issues related to CUDNN_STATUS_BAD_PARAM.

echan00 commented 5 years ago

It seems like the problem may be related to requiring data sample sizes to be a multiple of the batch size (See https://github.com/keras-team/keras/issues/11434). But in my case with CuDNN cell enabled, even one GPU is fast enough for time being.