Clarification of 'reward' variable meaning in RL lab

It was said in guide in section 6.4 that rewards [10., 2., 3.] are rewards for a transition into state 1, 2, and 3, respectively. But such definition of 'rewards' variable is not consistent with code of solution, despite code of solution is correct. So, here we change definition of 'rewards' variable to make it consistent with code. New definition is: 'rewards' variable contains expected values of the next reward for each state And naturally, for clarity we also introduce new variable - 'raw_rewards': 'raw_rewards' variable contains rewards obtained after transition to each state. As for 'rewards' variable, now we generate it from 'raw_rewards' and 'policy', instead of assigning directly. For more details see corresponding pull request in 'lxmls-guide' project: LxMLS/lxmls-guide#124

Minor changes in RL exersice Value Iteration

Change list:

Use floating point type for 'state_value_function' array instead of integer one because if numpy array is initialized with integer list, then all values assigned to its elements further will be casted to integer.

The rest changes are refactoring:

Use 'gamma' variable for discount factor instead of magic constant 0.1
Remove unused copy 's_v_f', it is not used in solution (as well as in algorithm in Sutton book p83, 2-nd edition)
Create q-values array explicitely for two reasons: (1) for more clarity and (2) to avoid too long lines as students will come with their small-screen laptops

Fix Policy Optimization RL exercises: Q-Learning and Value Iteration

Use rewards obtained after transition to state ('raw_rewards' variable) instead of expected values of the next reward using original policy ('rewards' variable) in both Policy Optimization exercises. In Value Iteration exercise also use next state ('s_next' variable) instead of source state ('s' variable) for reward indexing. Though according to 's_prime' variable name in original code maybe it was an attempt to perform Value Iteration in "backward direction" but I haven't managed to make such code produce correct results (and didn't see any algorithm working this way) so just implemented classic one (like in Sutton book).

Now all 3 optimal state Values are equal. And this is correct as in our MRP example reward only depends on next state, not on source state so initial state doesn't affect reward for optimal policy.

LxMLS / lxmls-toolkit

Clarification of 'reward' variable meaning in RL lab #130