In samples/rainbow/05_dqn_prio_replay.py, weights are propagated to batch_weights_v and multiplied by (state_action_values - expected_state_action_values) ** 2 to calculate losses_v.
(losses_v + 1e-5) is then used to calculate probabilites.
However, according to https://arxiv.org/pdf/1511.05952.pdf (Priority Experience Replay article, see Algorithm 1), TD-error is used as priority before it is multiplied to a weight.
In samples/rainbow/05_dqn_prio_replay.py, weights are propagated to batch_weights_v and multiplied by (state_action_values - expected_state_action_values) ** 2 to calculate losses_v.
(losses_v + 1e-5) is then used to calculate probabilites.
However, according to https://arxiv.org/pdf/1511.05952.pdf (Priority Experience Replay article, see Algorithm 1), TD-error is used as priority before it is multiplied to a weight.
Is it a mistake?