AlignmentResearch / KataGoVisualizer

MIT License
3 stars 1 forks source link

More precise compute estimates #41

Open ed1d1a8d opened 1 year ago

ed1d1a8d commented 1 year ago

From lightvector:

A. Recall from https://arxiv.org/abs/1902.10565 that 75% of the positions in the game are never saved from training. On those, the cheap visit limit is used. So you're going to be substantially underestimating the cost due to not counting compute costs on positions not saved, even though the visit limit on those is low.

B. There's a thing where stochastically some positions are saved out more than once, and some not at all (policy and value surprise weighting), but the computation for doing that should preserve the d in expectation, so I think this is neutral in expectation for your purpose, neither causing you to over nor under estimate.

C. KataGo used 600 full / 100 cheap for roughly the first 1-2 days of training (roughly up through b10c128 and maybe between 1/2 and 1/4 of b15c192), 1000 full / 200 cheap for the rest of g170 (i.e. all the kata1 models that were imported from the former run g170 that was done on private hardware alone, before that run became the prefix for the current distributed run kata1), and then 1500 full / 250 cheap for all of distributed training so far. So you'll need to use the appropriate cutoffs for visits by model range.

D. The cheap search limit is also used even for rows that are saved, once the winrate is sufficiently extreme, to save a bit on compute when playing out long endgames. However, the probability of writing the row decreases too, so I think this somewhat cancels out, but not entirely.

E. There's a neural net cache that reuses old queries, which is used if on a future turn you visit the same node that you already searched on the previous turn, or if multiple move sequences in a search lead to the same position. I think this typically saves somewhere between 20% and 50% of the cost of a search relative to a naive estimate based on the number of visits. So that means you're overestimating here.

F. A lot of KataGo's games are on smaller board sizes than 19x19. One could save some cost by only using smaller tensors in those cases, in practice that optimization wasn't implemented due to batching. So I think this is more just of a note, about where theoretical flops diverges significantly from practical flops due to practical engineering considerations. In the future I might implement this optimization in a way that still works dynamically with batching.

G. I think your notebook is greatly underestimating cost due to missing a ton of the networks from g170. They were only sparsely copied over on to katagotraining.org, since there was no value having all of them. You'll need to get the full list from https://katagoarchive.org/

H. You're both overcounting and undercounting a little in your notebook at different points due to not accounting for the fact that sometimes there is more than one model being jointly trained with another. I.e. one model generates the data, but both models train on it. So naively sorting the models by the d and multiplying size by change in d won't work, you need to filter out the models that are not generating the data at a given time.