Closed hans-ekbrand closed 3 years ago
Whether we down sample by a factor of 32 before or after shouldn't matter - we do multiple passes of the data and get multiple chances to apply value filtering to the results. The purpose of the 1/32 sampling is to have better shuffling of the data. However the current code implementation doesn't seem to be sample down by a factor of 32 after - its some curious if statement attempting to get the selection ratio to 1/32 in total. That may reduce the shuffling quality compared to before - although probably not meaningfully. (Almost certainly not meaningfully compared to pre value focus) If anything the only upside I would expect from this PR is maybe higher training speed when data pipeline is the bottleneck.
Whether we down sample by a factor of 32 before or after shouldn't matter - we do multiple passes of the data and get multiple chances to apply value filtering to the results.
Point taken, I didn't know we did multiple passes of the data. Does each pass start from the same position?
Still, if the number of passes is (well) below 32, some positions will never be shown to value focus, and if that is the case this PR will have its intended effect.
I have tested the code and it works. As tilps pointed out, one effect of this PR is that it speeds up the sampling of positions. While that is in itself a good thing, I wouldn't have bothered going furter with this PR if that was the only effect. In my testing I discovered a bug in the current code that made an assertion in shufflebuffer.py, in insert_or_replace() fail (see https://discord.com/channels/425419482568196106/425419999096733706/866464023212720149 for an elaborated description of that problem).
The PR turned out to be required for low values of value_focus_min
. Without the PR, train.py
bails out due to the failed assertion.
I tested the PR only with SL, and there I got good results with value_focus_min: 0.1
trained for 7.200 steps on a set of 875912 chunks of T74 training data. In RL, I suspect higer values are better, but I don't know how high one can go and still have an effect. Since T74 is about to be retired I suggest we finish it off with a value_focus experiment.
Closing this since it does not affect the distribution of q.diffs in the sampled records. The bug revealed here is still a problem, but it is most likely only hidden by this PR.
SKIP=32
intrain.py
defines how many sequental positions to discard to avoid sampling over-correlated positions since they typically are from sequential positions in a game. (technically, SKIP is used to create a probability of inclusion, so there is no hard limit against sequential positions, they are just not likely to appear).After discarding 31 out of 32 positions,
value_focus
discards further positions, especially positions whereorig_q
is very similar to eval after search (best_q
). This patch does two things:value_focus
sampling gets to see all positions (instead of only one position in 32).SKIP=32
, so that the target sampling rate is 1 overSKIP
when the downsampling performed byvalue_focus
is accounted for.Preliminary testing the idea rather than the implementation by hardcoding SKIP=1 and setting
value_focus_min
to 1/32 suggests this is an improvement.