Open IceFlowerLi opened 4 years ago
Can you share the Seq2seq code you are using ?
RNNs are only using MIOpen/cuDNN iff you are using the RNN, LSTM & GRU layers. The RNNCell, LSTMCell and GRUCell layers all fallback to generic CUDA code.
Additionally, MIOpen optimized layers do not support dropout which means it will fallback to the generic CUDA code if enabled.
A quick way to check if MIOpen Layers are called is to execute the script again with the environment variable export MIOPEN_ENABLE_LOGGING_CMD=1
.
It will print the configuration/command that is used by MIOpen, more information can be found here.
Hi, @Delaunay! Thank you for your advice.
The Seq2seq code can get from this repository -> https://github.com/keon/seq2seq
According to your suggestion, I scrutinize the model code part. RNNCell, LSTMCell and GRUCell are not used. Indeed, this seq2seq mode used the Dropout layer in RNN structure and after embedding layer. So, I comment these Dropout layer code and run the program again. The time cost in AMD machine is also 19 minutes mentioned before.
Following are program log files. The Dropout layer code were commented and the MIOpen debug enviroment variables were set. The program exit after one batch data were feeded and the backward progress is finished which is inserting the code exit(1)
in 60th line of model.py.
By dropout I meant; pytorch supports dropout inside the RNN layers, that particular dropout is not supported. Explicit Dropout layers should be fine.
torch.nn.RNN(..., dropout=0.5) # <= not supported by MIOpen, should be 0
Oh, I got it. Then I set the dropout inside the RNN into 0 and the model code turned into following file. Make sure the dropout value is 0 in all RNN structures. model.txt But the time cost is still 19 minutes and doesn't decrease any more. ?_?
Well, that sucks!
How do you compute the 19 min ? Something to keep in mind is the first batch is always magnitudes slower than the subsequent ones because MIOpen compiles kernels the first time.
I run the Seq2seq program in two machines and stop it after 6 epochs. It is obvious that the compile time in MIOpen is longer than cuDNN at program start. Of course, the program can test with more epochs and other parameters even other programs but I feel it is enough to show the problem. I hope RNN computation could be given more optimization in MIOpen. :yum:
/cc @daniellowell @ce1adon @aserio.
@IceFlowerLi which MIOpen version are you using? MIOpen v2.1 added Dropout support.
cc-ed folks here may have additional insights into the performance issue here.
Hi @IceFlowerLi. Do you still need assistance with this ticket? If not, please close the ticket. Thanks!
When I run the classical model Seq2seq in my machine with this program, I found that the program consits of RNN structures with the same parameters is far slower than NVIDIA platform even the Float32 performance of the Radeon 7 is greater than 1080ti. But the perfromance is proved in CNN operations that the Radeon 7 is faster than 1080ti in official pytorch examples mnist. Here is the comparation table:
I also try myself experiments that consist more complex RNN structure and large datasets. The program costs 2 hours per epoch in 1080ti and 9 hours per epoch in Radeon 7 with the same parameters. Of course, I am happy to see that others can give more test in more complex RNN and CNN models. In nvidia cuDNN official website, we can see the performance in sequence model achieve a large increment between the different version of cuDNN. Should our MIOpen library need more optimization in RNN operation?