Fix in RL exercises: Q-Learning and Value Iteration

I should have fixed it a year ago.

I had already added the same fix in exercise_1_3_solutions.ipynb (master branch), commit cef22423, a long ago.

However that time I forgot to add the same update in exercises_1_4.ipynb (master and student branches both contain this file).

So, let's add the same small but important fix here (I'm going to do it in master and student branches now synchronously).

Also, let me copy old commit (cef22423) message for reference:

Use rewards obtained after transition to state (raw_rewards variable) instead of expected values of the next reward using original policy (rewards variable) in both Policy Optimization exercises. In Value Iteration exercise also use next state (s_next variable) instead of source state (s variable) for reward indexing. Though according to s_prime variable name in original code maybe it was an attempt to perform Value Iteration in "backward direction" but I haven't managed to make such code produce correct results (and didn't see any algorithm working this way) so just implemented classic one (like in Sutton book).

Now all 3 optimal state Values are equal. And this is correct as in our particular MRP example reward only depends on next state, not on source state so initial state doesn't affect reward for optimal policy.

Counterpart pull request for master branch: #172

LxMLS / lxmls-toolkit

Fix in RL exercises: Q-Learning and Value Iteration #171