ROCm / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
219 stars 53 forks source link

Different performance between RNN and CNN #546

Open IceFlowerLi opened 4 years ago

IceFlowerLi commented 4 years ago

When I run the classical model Seq2seq in my machine with this program, I found that the program consits of RNN structures with the same parameters is far slower than NVIDIA platform even the Float32 performance of the Radeon 7 is greater than 1080ti. But the perfromance is proved in CNN operations that the Radeon 7 is faster than 1080ti in official pytorch examples mnist. Here is the comparation table:

Radeon 7 1080ti
Epoch 6 6
Concurrency platform ROCm 2.9 CUDA 10.0
Deep learning software MIOpen 2.1 cuDNN 7.5
Pytorch 1.3.1 1.3.1
Seq2seq time 19min 10min
Mnist time 2min39s 3min59s

I also try myself experiments that consist more complex RNN structure and large datasets. The program costs 2 hours per epoch in 1080ti and 9 hours per epoch in Radeon 7 with the same parameters. Of course, I am happy to see that others can give more test in more complex RNN and CNN models. In nvidia cuDNN official website, we can see the performance in sequence model achieve a large increment between the different version of cuDNN. Should our MIOpen library need more optimization in RNN operation?

Delaunay commented 4 years ago

Can you share the Seq2seq code you are using ?

RNNs are only using MIOpen/cuDNN iff you are using the RNN, LSTM & GRU layers. The RNNCell, LSTMCell and GRUCell layers all fallback to generic CUDA code.

Additionally, MIOpen optimized layers do not support dropout which means it will fallback to the generic CUDA code if enabled.

Delaunay commented 4 years ago

A quick way to check if MIOpen Layers are called is to execute the script again with the environment variable export MIOPEN_ENABLE_LOGGING_CMD=1.

It will print the configuration/command that is used by MIOpen, more information can be found here.

IceFlowerLi commented 4 years ago

Hi, @Delaunay! Thank you for your advice. The Seq2seq code can get from this repository -> https://github.com/keon/seq2seq According to your suggestion, I scrutinize the model code part. RNNCell, LSTMCell and GRUCell are not used. Indeed, this seq2seq mode used the Dropout layer in RNN structure and after embedding layer. So, I comment these Dropout layer code and run the program again. The time cost in AMD machine is also 19 minutes mentioned before. Following are program log files. The Dropout layer code were commented and the MIOpen debug enviroment variables were set. The program exit after one batch data were feeded and the backward progress is finished which is inserting the code exit(1) in 60th line of model.py.

log_amd.txt model.txt

Delaunay commented 4 years ago

By dropout I meant; pytorch supports dropout inside the RNN layers, that particular dropout is not supported. Explicit Dropout layers should be fine.

torch.nn.RNN(..., dropout=0.5) # <= not supported by MIOpen, should be 0
IceFlowerLi commented 4 years ago

Oh, I got it. Then I set the dropout inside the RNN into 0 and the model code turned into following file. Make sure the dropout value is 0 in all RNN structures. model.txt But the time cost is still 19 minutes and doesn't decrease any more. ?_?

Delaunay commented 4 years ago

Well, that sucks!

How do you compute the 19 min ? Something to keep in mind is the first batch is always magnitudes slower than the subsequent ones because MIOpen compiles kernels the first time.

IceFlowerLi commented 4 years ago

I run the Seq2seq program in two machines and stop it after 6 epochs. It is obvious that the compile time in MIOpen is longer than cuDNN at program start. Of course, the program can test with more epochs and other parameters even other programs but I feel it is enough to show the problem. I hope RNN computation could be given more optimization in MIOpen. :yum:

dagamayank commented 4 years ago

/cc @daniellowell @ce1adon @aserio.

@IceFlowerLi which MIOpen version are you using? MIOpen v2.1 added Dropout support.

cc-ed folks here may have additional insights into the performance issue here.

Delaunay commented 4 years ago

In that case it is not the same. cuDNN allows for dropout to be embedded in the RNN layer itself i.e the cudnn RNN descriptor has a DropoutDescriptor_t inside but MIOpen does not.

That is why the dropout inside the RNN layers should be 0. The check if done here on the pytorch side.