RNN Op Changes (was: Variable Length Support for cuDNN RNN)

apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

https://mxnet.apache.org

Apache License 2.0

20.78k stars 6.79k forks source link

RNN Op Changes (was: Variable Length Support for cuDNN RNN) #9543

Open sbodenstein opened 6 years ago

sbodenstein commented 6 years ago

We want to add support for variable-length sequences to the cuDNN RNN operator, as we cannot 'fake' support via masking for LSTM (the cuDNN operator does not return a history of cell states) and bidirectional RNNs. Also, there should be some speedups for batches of sequences of very different lengths (can avoid bucketing).

The annoying part is that cuDNN requires that the sequences be sorted by sequence length, and also packed as a contiguous block of memory. PyTorch deals with this by introducing a pack_padded_sequence operator, which returns a PackedSequence object that the user can feed to cuDNN RNN. MXNet probably (?) doesn't want to take this route, and hide these details from the user (other backends eg upcoming MKL for symbol.rnn might anyways not require this) and just pack/unpack inside symbol.RNN. So some questions:

is any work being done on this already? (so I don't duplicate)
@DickJC123: what is the best way of doing the sorted packing/unpacking efficiently? Are there any plans to add this utility function to cuDNN?

@sxjscience, @piiswrong, @szha

sbodenstein commented 6 years ago

Any comments on this?

lupesko commented 6 years ago

Awesome initiative @sbodenstein Bumping again to @piiswrong

@sandeep-krishnamurthy please label:

RNN
Feature

piiswrong commented 6 years ago

I don't think any one is working on this. This can be added as an option of sym.RNN. when use_mask=True, RNN can take an extra argument.

ping @DickJC123 @ptrendx again

DickJC123 commented 6 years ago

I've been remiss at commenting because frankly new cuDNN features are generally advertised only upon general availability. The cuDNN team is continually improving the feature set, and it's clear to us that sorting and packing of sequences is a customer pain point. As such, we are working on feature improvements in this area for one of our future releases. I'll point out that cudNN v7.1 is newly available for download, although this minor release has no new features to address variable-length sequences. If there's more detail you are able to share about your application, whether publicly or privately, that would be very helpful.

sbodenstein commented 6 years ago

So it seems as though we should wait a little bit before jumping in to add support. It seems as though Intel will have their RNN operator ready this month as well, so it might be a good time to revisit this then.

whamza15 commented 6 years ago

Any progress on this?

DickJC123 commented 6 years ago

The newly released cuDNN v7.2 (https://developer.nvidia.com/cudnn) now supports more RNN cell use cases with options for cell clipping and padding masks. Thanks for your patience on this, and feel free to comment on whether the new API meets your needs.

szha commented 6 years ago

Awesome new set of features! They can be found in https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#features-of-rnn-functions. Since the changes will likely require API changes in RNN op, I took the liberty to repurpose @sbodenstein 's issue for the API discussion.

Desirable new features would be:

Allow packed sequence data or padding handling for variable length sequences.
Support LSTM-projection for large-scale RNNs.
RNN state clipping.

Desirable performance improvements would be:

Allow down conversion to enable HMMA when possible.
Allow algorithm search for performance tuning through findIntensity.
Reuse the space for random states when dropout is enabled.

Unclear:

What's the recommendation for _ALGO_PERSIST_*? @DickJC123 @ptrendx

szha commented 6 years ago

Also pinging @sxjscience @eric-haibin-lin @sbodenstein @yunzhe-tao @piiswrong @Jerryzcn

szha commented 6 years ago

Pinging @pengzhao-intel, @Sherry-Zhang @lihaofd @TaoLv

pengzhao-intel commented 6 years ago

Just see the message :( The variable length RNNin CPU side is scheduled to implement in our Q4. @ciyongch

sbodenstein commented 6 years ago

@pengzhao-intel: thanks! Thats useful for us to know.

cgraywang commented 6 years ago

@DickJC123 @ptrendx I think the new features are very useful, since we are working on large scale NLP models, where most of the models are based on RNN cells. For example, we need cell state clipping and projection clipping to make the models overcome the possible overfitting. Similarly, it would be very helpful if we can have projection in the RNN cell, in that way, we can improve the memory size to significantly improve the model capacity. So, looking forward to the new features in cuDNN v7.2 to be in MXNet.

@pengzhao-intel Also looking forward to CPU support! Thanks.

szha commented 6 years ago

@haojin2 is driving the addition of the options for LSTM projection and rnn state clipping in #12541.

sbodenstein commented 6 years ago

@szha, @DickJC123: TensorCore support for RNNs is currently disabled. We would really like to make use of this for training large RNN models.

Are there any plans to add support for it soon?

Also: what about a flag cudnn_use_tensorcores (similar to cudnn flags in convolution). When true, it will use tensorcores even with float32 (and handle internal downcasting or use new cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION)). If its None, it will use standard heuristics (environment variable + check for float16).

szha commented 6 years ago

I'm not sure about adding the flag as part of the operator arguments, because this option is not meaningful to other implementation. I think instead we can have it as an environment variable similar to other performance-related features.

szha commented 6 years ago

We definitely want to add this support, though we are currently a bit short-handed. A patch is most certainly welcome.

sbodenstein commented 6 years ago

@szha: I'm happy to make a PR. One question about the behaviour: the flag MXNET_CUDA_ALLOW_TENSOR_CORE is set to true by default. Its behaviour is:

Only use TensorCores if DType is float16 and MXNET_CUDA_ALLOW_TENSOR_CORE is true. Presumably, the use of TensorCores will never worsen float16 training, so defaulting to TensorCore use seems reasonable.
But if DType is float32, then we can't assume the user wants to use TensorCores. But they do want a way of opting-in (and setting CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION when available).

So how about an environment variable MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION, which defaults to false, and if true, will let float32 or float64 nets use TensorCores by implicit downcasting?

szha commented 6 years ago

Yes, I agree that the new variable is necessary. This matches the "_ALLOW_CONVERSION" flags in the cudnn interface.