Closed koedem closed 4 years ago
Further info: position is this 3q1rk1/p4p1p/4p3/3pN3/3Pn3/3Q4/Pr3PPP/R4RK1 w - - 1 20 Engine: lc0-v0.21.1-windows-cuda, net: 256x20-2019_0601_0414_20_514.pb Command used was "go infinite" and then "stop" to stop the search.
How did you calculate Q to be 0.063?
Well, I calculated (roughly speaking) (sum over replies r: r_Q * r_nodes) / rootMove_nodes (Actually I calculated 0.063 to be an upper bound, by estimating every reply with less than 10KN to have Q of -1)
@MelleKoning Did I mess up when doing the calculations?
Did you account for the parents initial q fetched from the NN? Every nodes q is not only the children's q weighted by their visits but includes its own q (fetched on the nodes expansion from the NN).
@Videodr0me Just how much weight does that own q have? Even if it was 1000 times that of a leaf node it still wouldn't make the slightest difference.
On a side note, the reason for this bug is that Q is a float, having it be a double fixes the problem. In my short testing I don't see obvious problems with it being a double, same memory usage (roughly) and same speed (roughly). It does also have the expected behavior but in long searches for some reason sees some of the difficult tactics I tried it on later than the float version. But there's probably some more thorough testing needed to see if the change would help e.g. at TCEC time controls.
@Videodr0me Just how much weight does that own q have? Even if it was 1000 times that of a leaf node it still wouldn't make the slightest difference.
Potentially the initial Q fetched from the NN can diverge substantially from later Qs propagated through the tree. So it might have a larger impact if its initally way off, even with many samples. But as you found the limited precision to be the source of your discrepancy, that point is moot. I think we did that to save space, and because the accuracy of that average does not matter much if its not biassed in any direction.
That is the problem though. The precision loss e.g. from using fp16 does indeed not matter much for that reason, it is not biased. However the precision loss due to the incremental updating of Q is biased. Namely, in the very long run Q won't update at all anymore due to underflows so only the first however many (couple millions I guess) nodes influence Q at all, after that it just stays the same (or at least leads to serious rounding errors). Another related problem is that we're basically adding and substracting similarly small numbers which inherently is unstable so will lead to problems if we don't have a lot of precision. (as can be seen in the example above as well as the fact that the Double version doesn't have that problem)
@koedem there is indeed a difference when using float and double when calculating new Q.
Example calculation: https://onlinegdb.com/HkyR6RwAE
The issue with Q not changing if below the accuracy threshold has been discussed here: https://github.com/LeelaChessZero/lc0/issues/186 There are also a number of proposed solutions mentioned which do not require higher precision. One thing to keep in mind is that the amount of visits required will only be reached near root (or even at root, where it does not even matter), and that N still keeps updating. Its basically a memory vs. accuracy tradeoff, and we opted for memory at that point in time. I think much of the discussion referenced in that link is still valid.
In case someone cares, an example search: https://pastebin.com/4vzWzhQW (last move in the list is the highest ranked one) It actually changes till 178M or so and only then gets stuck. I imagine this is because eval is close to 0 so if the node eval is close to 1 or -1 it might still affect it. This has the further side effect that late in that search only extreme evals influence the root eval, introducing further biases. (including the effect that it will go towards 0, since for e.g. a positive eval, even an eval of +1 won't change the root anymore, but -1 will)
And yes, of course it costs 2*4 extra Bytes (Q and U EDIT: U is calculated from scratch, right? Then it wouldn't need to be a double, I think float should be enough accuracy there). But with the edges and node together being like 300 Bytes that doesn't sound like the biggest of deals to me. And it could of course be an option. (either run or compile time)
I also was wondering about that only few nodes needing it, a node might be transformed to DoubleNode once it hits some number of visits although that might be slightly trickier to implement.
Just saw another scenario where it very clearly made the eval wrong: There was some position that was completely winning for one side but while the eval started climbing from Q = 0.85 to 0.9 or so it then started to drop to 0.8 or so despite the Double eval keeping rising. The reason probably being that evals that are +1 don't change the eval anymore because of the difference being only 0.1 and with that divided by node count it's too small to matter, whereas the few positions which get evaluated -1 and thereby have a Q difference of almost 2 do still influence the eval thereby falsely making the eval drop over time.
Issue still relevant, will probably only be fixed by a double implementation of the relevant variables. @MelleKoning seems to have worked on it in the past, and some ideas of deciding on compile time which variables should be float/double are floating around.
Would it be better to fix this by re-averaging nodes every 30k visists or so? That way, we don't use extra memory, but get the accuracy bonuses.
While using the Qdouble stuff I used like 2% extra memory which tbh is negligible. I don't think re-averaging fully fixes the issue, there will still be some increasing rounding errors. (though I guess if you re-average at least the error might not be as biased anymore which is a plus)
Note that originally we stored W
(sum of all V
in subtree) (and Q=W/N
), and that's actually how it is described in a0 paper.
After some time we optimized W
out as it didn't seem necessary.
Maybe it makes sense to return W
.
If I remember correctly, the reason we don't use it is that it adds an extra division every time we want to compute Q in Q+U. It would almost certainly be a quite noticeable slowdown.
We (and a0) stored both Q and W. Q was updated only on backprop when W is changed.
Doesn't that just use as much memory as storing a double for Q?
It does, but potentially it accumulates less error. It's hard to say though whether it's indeed so. And yeah, we save on double-to-float conversion which is surprisingly noticeable.
This is actually relatively easy to estimate. If we use a float for W, then we have 24 bits of manitisa, so for a Q of .5 for example, once we reach 10 million visits, adding V to W will not change W. This means that with this strategy, we would expect Q to approach 0 at a rate of 1/n once 10 million visits were reached. This is why the best approach is probably to just use a float for Q, and recalculate the average from scratch every now and then. These recalculations don't need to happen often, and therefore remove any accumulated inaccuracy without memory or performance impact.
Hmm we can use double for W and float for Q. As we only convert it on node update, it won't slow down visits. As P are still float, memory usage will only grow by 1% or so.
I've plotted Q updates in simulated scenario: V gradually grows from -0.5 to +0.5, with Gaussian noise (sigma = 0.2).
The result is pretty bad for float, it stops updating already after 20'000'000 steps: ^^^ Wrong x axis label, it's millions of steps, not thousands.
When I tried it without noise, float Q didn't update at all. It just stayed at -0.5 all the time as update step was too little.
Float W is not great either, especially when V is close to 0.
Even at 100 million nodes, it stops updating at all around 20 million nodes: ^^^ Wrong x axis label, it's millions of steps, not thousands.
#include <iomanip>
#include <iostream>
#include <random>
constexpr uint64_t kSteps = 50000000;
constexpr int kReport = 10000;
int main() {
std::random_device dev;
std::mt19937 rng(dev());
std::normal_distribution<float> dist(0, 0.2);
std::cout << std::setprecision(20);
float q_float = 0.0f;
double q_double = 0.0;
float w_float = 0.0f;
double w_double = 0.0;
uint64_t n = 0;
for (uint64_t i = 0; i <= kSteps; ++i) {
float v = i == 0 ? -0.5 : float(i) / kSteps - 0.5 + dist(rng);
++n;
q_float += (v - q_float) / n;
q_double += (v - q_double) / n;
w_float += v;
w_double += v;
if (i % kReport == 0) {
std::cout << i << "," << (float(i) / (kSteps * 2) - 0.5) << "," << q_float
<< "," << q_double << "," << w_float / n << "," << w_double / n
<< "," << v << "\n";
}
}
};
Just changing float to double for wl_
and d_
in Node
reduces nps in random backend from 199236 to 196008 (-1.6%) for me..
Probably worthwhile.
@mooskagh there's one significant problem with your test which is that float Q only looks good since the true average goes towards 0. Check what happens if instead the values go from 0 to .7 for example.
Float Q actually looked bad, with going away from 0 problem start later. Actually what matters is the "current" value of Q. When it's close to 0, it can survive a bit longer.
So we will actually (probably) have double Q in the next Leela version? That means I can finally use master again instead of my crappy year old fork. :D Thanks a lot.
I ran a long-ish search and got this output which clearly can't be right: https://pastebin.com/VY8H0NGh
If you calculate the Q that this d3a3 move should have then unless I made an error calculating, it should be around 0.063, rather than the 0.075 reported. No idea why it happens but that is quite a difference.