Replace bitboards with counters

BartMassey commented 5 years ago

As stated in issue #2, right now the performance bottleneck is the RNG. Once that is fixed, the bitboards should be replaced with counters to improve performance.

An array of counters per card can be used along with a per-marker array of counter indices to bump. This should reduce the per-card win test cost quite a bit.

BartMassey commented 5 years ago

OK, here's the details of the insane plan I came up with.

Each board has 12 4-bit counters showing the number of pieces in that row, column or diagonal. [8 bytes for a 64-bit representation]
Build a 5x5 "instruction table" common to all boards, which says for a given marker position on the board which combination of the 12 counters to bump.
Each board has a 5x5 "action table" of byte indices into the instruction table, indicating which instruction to use for that position on this board. (These could be packed 5-bit entries, saving 9 bytes, but ugh.) [25 bytes]
Each board has a 72-bit bitboard indicating which squares are active on this board. [16 bytes]

Total size of per-board state: 49 bytes. The bitboard representation we're using now uses 16 * 12 = 192 bytes, and we have 12 bytes of counters, so we're currently at 204 bytes. The memory savings from the new scheme is thus more than 4x.

There's a one weird trick here: old-schoolers hate it. How do we get from a hit in the bitboard to an index into the action table? We count the number of 1 bits to the left of the position in the bitboard. That sounds dumb on the face of it. However, modern processors have a 1-cycle 64-bit POPCNT instruction. We can do some shifts and a couple of popcnts and be done with it.

This scheme has another advantage. Most of the time a marker will miss this card entirely. In that case all we'll touch is the card's bitboard: the rest of it will be given a miss. This should save some CPU time and memory accesses.

The downside is that this is a lot of fiddly code to write. I think it's worth it. It's a nice demo of some things for my Systems Programming class. It also shows why everybody kept whining about "why don't CPUs do popcount?"

BartMassey commented 5 years ago

I think my scheme above is overcomplicated. Let's try scheme 2:

75-bit bitboard for squares on the board, with popcount trick, as before. Split it into a 64-bit field (A) and a 16-bit field (B) (8 + 2 bytes)
10 3-bit row and column counters (C) (4 bytes)
2 3-bit diagonal counters (D) (1 byte)
1 byte of counter increment instructions for each square on the 5x5 board (E). A counter increment instruction is formatted as 1 bit for negative diagonal, 1 bit for positive diagonal, 3 bits for row, 3 bits for column. If a diagonal bit is 0, this square is not part of that diagonal. Every square is part of a row and a column. (25 bytes)

Total: 40 bytes. If you order the fields E-D-B-C-A no padding is needed to store these in an array. There's also no need for any kind of global table: you could save 3 bytes total on the increment instructions by adding one, but it would be hairy to implement and you'd just have to give it back in padding anyhow.

There are games you could play with the free square (if implemented, see issue #1) to get the counters into a single bit word: the diagonal counters, as well as the middle row and column, can be 2-bit counters since they'll never get above 3 before the game is over. This gives us 8×3 + 4×2 = 32 bits. But given the hassle of the packing, and given the padding argument above, and given that the free square is currently optional, and given some clever tricks one can do with the uniform counter size with some "headroom", this seems like a terrible idea.

BartMassey commented 5 years ago

Update: I've implemented this and it seems to work well. About a 6x speedup over bitboards. I need to finish validating it to make sure I don't have more bugs before I close this, though.

BartMassey / bingomatic

Replace bitboards with counters #5