Closed Jamesflynn1 closed 1 year ago
We can 1. store prior strategies to state (or atleast the index of) for the state player 2. store action history to state for the state player 3. recreate the probs leading to state according to deviation 4. weight the states to either be included or excluded according to the memory weights (i.e deviation type dependent).
1 to 3 has been tested using a working computation of state player reach prob (OpenSpiel CFR computation reach_prob) and with 4 set to np.ones, they are equivalent for 100 iterations (but two iterations would be sufficient to check) for ALL states.
BPS appears to have CFR like performance maybe slightly less exploitable, will check with 10000 iterations on the batch compute.
This appears to have fixed the issue.
Bug to do with either histories or deviation variables. Should be an easy fix, need to test on blind action deviations first.