To be able to use more standard reinforcement learning code, we want to be able to vectorize our algorithm. To do this we will need to make two matrices:
P - the probability transition function - an |A|x|S|x|S| matrix where the probability we end up in state s' given we start in state s and take action a is P[a, s, s'].
R - the reward function - an |A|x|S| matrix where the reward we do for doing action a in state s is R[a, s].
In addition, because we want these to be 3 and 2 dimensional matrices respectively, we will want to make a conversion between state and a singular index. The same will need to be done to represent actions with a singular index. This can be done by whenever we have an array representing the current state or action's indices in a multidimensional matrix, taking the first number, then to do the next one, multiplying by the size of that array's dimension then adding the next number.
Example:
The state array is a 10x20x30x40 matrix and we have indices [5, 6, 7, 8] that we want to convert to a single index. We do ((520+6)30+7)*40+8 = 127488
To be able to use more standard reinforcement learning code, we want to be able to vectorize our algorithm. To do this we will need to make two matrices:
P - the probability transition function - an |A|x|S|x|S| matrix where the probability we end up in state s' given we start in state s and take action a is P[a, s, s'].
R - the reward function - an |A|x|S| matrix where the reward we do for doing action a in state s is R[a, s].
In addition, because we want these to be 3 and 2 dimensional matrices respectively, we will want to make a conversion between state and a singular index. The same will need to be done to represent actions with a singular index. This can be done by whenever we have an array representing the current state or action's indices in a multidimensional matrix, taking the first number, then to do the next one, multiplying by the size of that array's dimension then adding the next number.
Example: The state array is a 10x20x30x40 matrix and we have indices [5, 6, 7, 8] that we want to convert to a single index. We do ((520+6)30+7)*40+8 = 127488