fixed RED predict_reward error/bug.

Before: took the square distance between One [state,action] pair and ALL other pairs in the batch (for every [state, action] pair), see _square_distance(x,y) function in models.py, which is used in _gaussian_kernel function. But the predicted reward is supposed to be the square distance between output of two different NN (Predictor vs Target network) with input:[state, action].