BartMassey / bingomatic

Monte-Carlo Bingo simulator in C
Other
1 stars 0 forks source link

Replace bitboards with counters #5

Open BartMassey opened 5 years ago

BartMassey commented 5 years ago

As stated in issue #2, right now the performance bottleneck is the RNG. Once that is fixed, the bitboards should be replaced with counters to improve performance.

An array of counters per card can be used along with a per-marker array of counter indices to bump. This should reduce the per-card win test cost quite a bit.

BartMassey commented 5 years ago

OK, here's the details of the insane plan I came up with.

Total size of per-board state: 49 bytes. The bitboard representation we're using now uses 16 * 12 = 192 bytes, and we have 12 bytes of counters, so we're currently at 204 bytes. The memory savings from the new scheme is thus more than 4x.

There's a one weird trick here: old-schoolers hate it. How do we get from a hit in the bitboard to an index into the action table? We count the number of 1 bits to the left of the position in the bitboard. That sounds dumb on the face of it. However, modern processors have a 1-cycle 64-bit POPCNT instruction. We can do some shifts and a couple of popcnts and be done with it.

This scheme has another advantage. Most of the time a marker will miss this card entirely. In that case all we'll touch is the card's bitboard: the rest of it will be given a miss. This should save some CPU time and memory accesses.

The downside is that this is a lot of fiddly code to write. I think it's worth it. It's a nice demo of some things for my Systems Programming class. It also shows why everybody kept whining about "why don't CPUs do popcount?"

BartMassey commented 5 years ago

I think my scheme above is overcomplicated. Let's try scheme 2:

Total: 40 bytes. If you order the fields E-D-B-C-A no padding is needed to store these in an array. There's also no need for any kind of global table: you could save 3 bytes total on the increment instructions by adding one, but it would be hairy to implement and you'd just have to give it back in padding anyhow.

There are games you could play with the free square (if implemented, see issue #1) to get the counters into a single bit word: the diagonal counters, as well as the middle row and column, can be 2-bit counters since they'll never get above 3 before the game is over. This gives us 8×3 + 4×2 = 32 bits. But given the hassle of the packing, and given the padding argument above, and given that the free square is currently optional, and given some clever tricks one can do with the uniform counter size with some "headroom", this seems like a terrible idea.

BartMassey commented 5 years ago

Update: I've implemented this and it seems to work well. About a 6x speedup over bitboards. I need to finish validating it to make sure I don't have more bugs before I close this, though.