I just find that the picture generated by origin codes doesn't match the Figure 6.7 in the Sutton's book.
This is the Figure 6.7 in Sutton's book:
However, the picture generate by origin codes:
I think the problem should be in the line 116,117, I am modifying it to the following code:
left_counts_q = left_counts_q.mean(axis=0)left_counts_double_q = left_counts_double_q.mean(axis=0)
This is the new picture generated by the revised code:
This output resembles the Figure 6.7 in Sutton's book.
Another suggestion (Following code used to replace the origin code in line 93):
best_action = np.random.choice([action_ for action_, value_ in enumerate(active_q[next_state]) if value_ == np.max(active_q[next_state])])
I just find that the picture generated by origin codes doesn't match the Figure 6.7 in the Sutton's book. This is the Figure 6.7 in Sutton's book: However, the picture generate by origin codes: I think the problem should be in the line 116,117, I am modifying it to the following code:
left_counts_q = left_counts_q.mean(axis=0)
left_counts_double_q = left_counts_double_q.mean(axis=0)
This is the new picture generated by the revised code: This output resembles the Figure 6.7 in Sutton's book.Another suggestion (Following code used to replace the origin code in line 93):
best_action = np.random.choice([action_ for action_, value_ in enumerate(active_q[next_state]) if value_ == np.max(active_q[next_state])])