EricSteinberger / Deep-CFR

Scalable Implementation of Deep CFR and Single Deep CFR
MIT License
278 stars 61 forks source link

Why mean over all actions sampled in multi outcome sampling #7

Open annw0922 opened 4 years ago

annw0922 commented 4 years ago

https://github.com/EricSteinberger/Deep-CFR/blob/master/DeepCFR/workers/la/sampling_algorithms/MultiOutcomeSampler.py

as 'aprx_imm_reg' here is computed for every action and put to buffer without being summed up, I have no idea why 'aprx_imm_reg *= legal_action_mask / n_actions_to_smpl '

I think it is because I could not understand the formula here(v~(I) = p(a) |A(I)), and I failed find corresponding part in your paper, """ Last state values are the average, not the sum of all samples of that state since we add v~(I) = p(a) |A(I)|. Since we sample multiple actions on each traverser node, we have to average over their returns like: v~(I) Sum_a=0_N (v~(I|a) p(a) * ||A(I)|| / N). """

is there any reference for it?

thanks a lot

EricSteinberger commented 4 years ago

Hi! This is to make sure that the estimate is not scaled up just because you sample more actions. The regrets get more accurate the more actions you sample but the expectation of the value should stay the same and not go up linearly. Does this make sense? It's not in the paper, you are right - thank you for checking before opening the issue, appreciated! This is an implementation detail and the paper itself doesn't use MOS sampling - it uses External sampling where this division doesn't really matter