I was checking your answer for exercise 3.29, and I think it might have a mistake. The final equation averages over all actions, whereas I think it should be the maximum of all actions - hence removing the policy function.
I believe it is a mistake because the backup diagram for q*(page 64) shows the maximum rather than the average.
I was checking your answer for exercise 3.29, and I think it might have a mistake. The final equation averages over all actions, whereas I think it should be the maximum of all actions - hence removing the policy function.
I believe it is a mistake because the backup diagram for q*(page 64) shows the maximum rather than the average.
Looking forward to hearing from you!