Open sbodenstein opened 6 years ago
Any comments on this?
Awesome initiative @sbodenstein Bumping again to @piiswrong
@sandeep-krishnamurthy please label:
I don't think any one is working on this. This can be added as an option of sym.RNN. when use_mask=True, RNN can take an extra argument.
ping @DickJC123 @ptrendx again
I've been remiss at commenting because frankly new cuDNN features are generally advertised only upon general availability. The cuDNN team is continually improving the feature set, and it's clear to us that sorting and packing of sequences is a customer pain point. As such, we are working on feature improvements in this area for one of our future releases. I'll point out that cudNN v7.1 is newly available for download, although this minor release has no new features to address variable-length sequences. If there's more detail you are able to share about your application, whether publicly or privately, that would be very helpful.
So it seems as though we should wait a little bit before jumping in to add support. It seems as though Intel will have their RNN operator ready this month as well, so it might be a good time to revisit this then.
Any progress on this?
The newly released cuDNN v7.2 (https://developer.nvidia.com/cudnn) now supports more RNN cell use cases with options for cell clipping and padding masks. Thanks for your patience on this, and feel free to comment on whether the new API meets your needs.
Awesome new set of features! They can be found in https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#features-of-rnn-functions. Since the changes will likely require API changes in RNN op, I took the liberty to repurpose @sbodenstein 's issue for the API discussion.
Desirable new features would be:
Desirable performance improvements would be:
findIntensity
.Unclear:
_ALGO_PERSIST_*
? @DickJC123 @ptrendx Also pinging @sxjscience @eric-haibin-lin @sbodenstein @yunzhe-tao @piiswrong @Jerryzcn
Pinging @pengzhao-intel, @Sherry-Zhang @lihaofd @TaoLv
Just see the message :( The variable length RNNin CPU side is scheduled to implement in our Q4. @ciyongch
@pengzhao-intel: thanks! Thats useful for us to know.
@DickJC123 @ptrendx I think the new features are very useful, since we are working on large scale NLP models, where most of the models are based on RNN cells. For example, we need cell state clipping and projection clipping to make the models overcome the possible overfitting. Similarly, it would be very helpful if we can have projection in the RNN cell, in that way, we can improve the memory size to significantly improve the model capacity. So, looking forward to the new features in cuDNN v7.2 to be in MXNet.
@pengzhao-intel Also looking forward to CPU support! Thanks.
@haojin2 is driving the addition of the options for LSTM projection and rnn state clipping in #12541.
@szha, @DickJC123: TensorCore support for RNNs is currently disabled. We would really like to make use of this for training large RNN models.
Are there any plans to add support for it soon?
Also: what about a flag cudnn_use_tensorcores
(similar to cudnn flags in convolution). When true
, it will use tensorcores even with float32
(and handle internal downcasting or use new cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION)
). If its None
, it will use standard heuristics (environment variable + check for float16
).
I'm not sure about adding the flag as part of the operator arguments, because this option is not meaningful to other implementation. I think instead we can have it as an environment variable similar to other performance-related features.
We definitely want to add this support, though we are currently a bit short-handed. A patch is most certainly welcome.
@szha: I'm happy to make a PR. One question about the behaviour: the flag MXNET_CUDA_ALLOW_TENSOR_CORE
is set to true
by default. Its behaviour is:
DType
is float16
and MXNET_CUDA_ALLOW_TENSOR_CORE
is true
. Presumably, the use of TensorCores will never worsen float16
training, so defaulting to TensorCore use seems reasonable.DType
is float32
, then we can't assume the user wants to use TensorCores. But they do want a way of opting-in (and setting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
when available). So how about an environment variable MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION
, which defaults to false
, and if true
, will let float32
or float64
nets use TensorCores by implicit downcasting?
Yes, I agree that the new variable is necessary. This matches the "_ALLOW_CONVERSION" flags in the cudnn interface.
We want to add support for variable-length sequences to the cuDNN RNN operator, as we cannot 'fake' support via masking for LSTM (the cuDNN operator does not return a history of cell states) and bidirectional RNNs. Also, there should be some speedups for batches of sequences of very different lengths (can avoid bucketing).
The annoying part is that cuDNN requires that the sequences be sorted by sequence length, and also packed as a contiguous block of memory. PyTorch deals with this by introducing a pack_padded_sequence operator, which returns a
PackedSequence
object that the user can feed to cuDNN RNN. MXNet probably (?) doesn't want to take this route, and hide these details from the user (other backends eg upcoming MKL forsymbol.rnn
might anyways not require this) and just pack/unpack insidesymbol.RNN
. So some questions:@sxjscience, @piiswrong, @szha