My solution has similar shape with the book, but different start state value under the greedy policy. I am not sure where goes wrong, probably in the reward calculation? But my results are similar to all the other people's results which I found online (see below reference implementation). So just take my solution as one of the references, don't treat it absolutely correct.
My solution has similar shape with the book, but different start state value under the greedy policy. I am not sure where goes wrong, probably in the reward calculation? But my results are similar to all the other people's results which I found online (see below reference implementation). So just take my solution as one of the references, don't treat it absolutely correct.
Other reference implementation: