Closed wzfhaha closed 3 years ago
In general policy iteration in the policy improvement step we are looking for all the actions that satisfy
r_i(a) + \alpha \sum_j p_{ij}(a) v_j - v_i > 0
where r_i(a)
and p_{ij}(a)
are the reward and transition law given action a
, and v_i
is the mean rewards for state i
obtained in the policy evaluation step. We can do this locally for every state, independently of the other states. For more details, see for example Algorithm 3.1 in these lecture notes on Markov Decision Processes.
Now, assume for a given state i
we have only a single action: deciding whether to activate the fragile link (i, j)
and there are no other outgoing edges. Also note that in our case the rewards are action Independent, i.e. r_i(a) = r_i
. The above equation can be now rewritten as
r_i + \alpha v_j - v_i > 0
, which equals vj - \frac{v_i - r_i}{\alpha} > 0
.
Going back to the general case of f_i
fragile edges for each node i
. You'll notice that we are checking each fragile edge separately rather than checking all possible combinations of 2^f_i
actions (i.e. all deactivated, first edge activated and the others deactivated, ..., all edges activated). We are allowed to do this since as we show later in the paper we can rewrite the MDP into an equivalent MDP such that each state has at most two actions (on or off). Basically, we are doing policy iteration on the auxiliary graph but because of the special structure of the auxiliary states we don't have to explicitly include them. Something similar was also done in previous work: See the PageRank Iteration Algorithm
in PageRank Optimization by Edge Selection. This is essentially a generalized version of that.
You can also look into Theorem 5 of Ergodic Control and Polyhedral approaches to PageRank Optimization. They arrive to a similar conclustion that "any optimal link strategy must choose for every controlled page i
all the facultative links (i, j)
such that v_j > \frac{v_i−r_i}{\alpha}
.
Hope this helps.
Dear author, I am really struggling with understanding the algorithm 1, why is the policy improvement calculated in that way (line 4 of the Algorithm 1)? could you explain more about it? Thanks very much!