Closed echan00 closed 5 years ago
What's your kashgari's version?
Yes, confirming this is fixed on v0.5.3
@BrikerMan, I was wondering whether it Is normal for training to take much longer in the new version? After I upgraded to v0.5.3, (both one or multi-gpu) training takes ~6 hours per epoch (compared to 30-60 mins previously)
Val accuracy/loss numbers look much more appropriate now though. I'm hopeful it resolves the issues I was having with https://github.com/BrikerMan/Kashgari/issues/196
@echan00 In the new version, CuDNN cell is disabled by default. You could checkout details here. https://github.com/BrikerMan/Kashgari/releases You could speed up training by using the CuDNN cell, here is the tutorial https://kashgari.bmio.net/tutorial/text-classification/#speed-up-with-cudnn-cell
Just tried enabling CuDNN cell but received an error similar to the one posted on this bug:
Thank you so much for your time! Your support is incredible.
2019-09-10 23:44:08.238355: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-10 23:44:15.812470: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-10 23:44:18.023753: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:18.024208: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:18.918001: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:18.918423: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:19.827998: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:19.828010: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:21.643513: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
2019-09-10 23:44:21.643513: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at cudnn_rnn_ops.cc:1336 : Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
Traceback (most recent call last):
File "process2-4.py", line 276, in <module>
model.fit(train_final_x, train_final_y, valid_final_x, valid_final_y, epochs=5)
File "/usr/local/lib/python3.6/dist-packages/kashgari/tasks/base_model.py", line 293, in fit
**fit_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator
steps_name='steps_per_epoch')
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_generator.py", line 264, in model_iteration
batch_outs = batch_function(*batch_data)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1175, in train_on_batch
outputs = self.train_function(ins) # pylint: disable=not-callable
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
run_metadata=self.run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node replica_0/model_6/layer_blstm/CudnnRNN_1}}]]
[[replica_1/model_6/activation/truediv/_3967]]
(1) Unknown: CUDNN_STATUS_BAD_PARAM
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1285): 'cudnnSetTensorNdDescriptor( tensor_desc.get(), data_type, sizeof(dims) / sizeof(dims[0]), dims, strides)'
[[{{node replica_0/model_6/layer_blstm/CudnnRNN_1}}]]
0 successful operations.
4 derived errors ignored.
This issue seems to related to CuDNN cell on multi-GPU. Maybe you could search issues related to CUDNN_STATUS_BAD_PARAM
.
It seems like the problem may be related to requiring data sample sizes to be a multiple of the batch size (See https://github.com/keras-team/keras/issues/11434). But in my case with CuDNN cell enabled, even one GPU is fast enough for time being.
Having trouble training my model with multi-GPUs. Having trouble determining the cause of the error. It is below: