...use the upper confidence bound in equation (9.1) to compute the optimal action for each state with an exploration parameter ...
might be a bit misleading. My understanding of this problem was to use the exploration parameters and tables to select the best action to traverse/explore during an MCTS at each state. I think a slight rewording to remove "optimal" would help clarify this exercise. As worded, I think it could be interpreted as we have completed our search and we need to pick the action that maximizes our estimate of Q (and the exploration parameters are not used).
Example 9.3 uses the following phrase when discussing a similar step:
The second simulation begins by selecting the best action from the initial state according to our exploration strategy in equation (9.1).
I think the wording of
might be a bit misleading. My understanding of this problem was to use the exploration parameters and tables to select the best action to traverse/explore during an MCTS at each state. I think a slight rewording to remove "optimal" would help clarify this exercise. As worded, I think it could be interpreted as we have completed our search and we need to pick the action that maximizes our estimate of Q (and the exploration parameters are not used).
Example 9.3 uses the following phrase when discussing a similar step: