LxMLS / lxmls-toolkit

Machine Learning applied to Natural Language Processing Toolkit used in the Lisbon Machine Learning Summer School
Other
223 stars 215 forks source link

Clarification of 'reward' variable meaning in RL lab #130

Closed q0o0p closed 5 years ago

q0o0p commented 5 years ago

It was said in guide in section 6.4 that rewards [10., 2., 3.] are rewards for a transition into state 1, 2, and 3, respectively. But such definition of 'rewards' variable is not consistent with code of solution, despite code of solution is correct. So, here we change definition of 'rewards' variable to make it consistent with code. New definition is: 'rewards' variable contains expected values of the next reward for each state And naturally, for clarity we also introduce new variable - 'raw_rewards': 'raw_rewards' variable contains rewards obtained after transition to each state. As for 'rewards' variable, now we generate it from 'raw_rewards' and 'policy', instead of assigning directly. For more details see corresponding pull request in 'lxmls-guide' project: LxMLS/lxmls-guide#124


Minor changes in RL exersice Value Iteration

Change list:

The rest changes are refactoring:


Fix Policy Optimization RL exercises: Q-Learning and Value Iteration

Use rewards obtained after transition to state ('raw_rewards' variable) instead of expected values of the next reward using original policy ('rewards' variable) in both Policy Optimization exercises. In Value Iteration exercise also use next state ('s_next' variable) instead of source state ('s' variable) for reward indexing. Though according to 's_prime' variable name in original code maybe it was an attempt to perform Value Iteration in "backward direction" but I haven't managed to make such code produce correct results (and didn't see any algorithm working this way) so just implemented classic one (like in Sutton book).

Now all 3 optimal state Values are equal. And this is correct as in our MRP example reward only depends on next state, not on source state so initial state doesn't affect reward for optimal policy.