Update AMD and NVIDIA RNN benchmarks to use *RNNForwardInference instead of *RNNForwardTraining

As discussed in #114 , the standalone AMD and NVIDIA inference passes for the RNN benchmark use RNNForwardTraining, which is only required when training is also going to be run. This change modifies the AMD and NVIDIA RNN benchmarks to use RNNForwardInference instead, which the cuDNN and MIOpen documentation indicate is the appropriate/sufficient call to use for inference-only passes.

I tested this locally, and they pass. They also usually see small improvements (5% or less for the ones I spot checked) by avoiding storing the intermediate data required for training.

For the AMD code, I needed to include rocBLAS in the Makefile path to get it to compile gemm when I was re-making everything. I suspect this is only required if rocBLAS is installed in a non-standard location (i.e., other than /opt/rocm), but I don't have the ability to test this so I included it as a separate commit in case others come across this problem. I can break this out into a separate pull request if people prefer that.

baidu-research / DeepBench

Update AMD and NVIDIA RNN benchmarks to use RNNForwardInference instead of RNNForwardTraining #117

baidu-research / DeepBench

Update AMD and NVIDIA RNN benchmarks to use *RNNForwardInference instead of *RNNForwardTraining #117

Update AMD and NVIDIA RNN benchmarks to use RNNForwardInference instead of RNNForwardTraining #117