charleskawczynski commented 3 years ago

This PR tries to do better with allocations by:

Supporting SubArray, so that we can pass in views
Avoid allocating a big tuple of tuples for the unrolled combinations.

Closes #27. I'm not sure what's going on with allocations / performance:

julia> using BenchmarkTools, PlayingCards, PokerHandEvaluator

julia> using StatsBase: sample

julia> const fd = full_deck();

julia> sample_cards(N::Int) = sample_cards(Val(N))
sample_cards (generic function with 1 method)

julia> sample_cards(::Val{N}) where {N} = @view fd[sample(1:52, N; replace=false)]
sample_cards (generic function with 2 methods)

julia> @btime CompactHandEval(sample_cards(7))
  132.720 ms (424272 allocations: 26.70 MiB)
CompactHandEval{PokerHandEvaluator.HandTypes.TwoPair}(PokerHandEvaluator.HandTypes.TwoPair(), 2732)

julia> @btime CompactHandEval(sample_cards(7))
  21.955 ms (33420 allocations: 2.33 MiB)
CompactHandEval{PokerHandEvaluator.HandTypes.OnePair}(PokerHandEvaluator.HandTypes.OnePair(), 5539)

julia> @btime CompactHandEval(sample_cards(7))
  2.923 ms (3525 allocations: 203.67 KiB)
CompactHandEval{PokerHandEvaluator.HandTypes.TwoPair}(PokerHandEvaluator.HandTypes.TwoPair(), 3051)

julia> @btime CompactHandEval(sample_cards(7))
  59.600 μs (107 allocations: 2.28 KiB)
CompactHandEval{PokerHandEvaluator.HandTypes.TwoPair}(PokerHandEvaluator.HandTypes.TwoPair(), 2942)

Here is a sample of the size of allocations where we avoid creating the large tuple of combinations:

# main
@ballocated CompactHandEval(sample_cards(5)) # 288
@ballocated CompactHandEval(sample_cards(6)) # 976
@ballocated CompactHandEval(sample_cards(7)) # 9200
@ballocated FullHandEval(sample_cards(5)) # 816
@ballocated FullHandEval(sample_cards(6)) # 1920
@ballocated FullHandEval(sample_cards(7)) # 15680

# this PR
@ballocated CompactHandEval(sample_cards(5)) # 288
@ballocated CompactHandEval(sample_cards(6)) # 608
@ballocated CompactHandEval(sample_cards(7)) # 1728
@ballocated FullHandEval(sample_cards(5)) # 816
@ballocated FullHandEval(sample_cards(6)) # 944
@ballocated FullHandEval(sample_cards(7)) # 2384

It seems like we get roughly 3x slower what I would naively expect (~21*2 μs) at the end, but I'm not sure why it's changing so much each time. I tried first including the newly added file (precompile) which calls all the core evaluate5 methods, but this didn't seem to help. Maybe the precompile file needs to call the top level function (e.g., CompactHandEval/FulltHandEval) instead of the core method (evaluate5)?

If it's any consolation, interpolating the entire view seems to measure great performance right from the start:

julia> using StatsBase, BenchmarkTools, PlayingCards, PokerHandEvaluator

julia> const fd = full_deck();

julia> @btime CompactHandEval($(@view fd[sample(1:52, 7; replace=false)]))
  11.400 μs (107 allocations: 1.69 KiB)
CompactHandEval{PokerHandEvaluator.HandTypes.HighCard}(PokerHandEvaluator.HandTypes.HighCard(), 6451)

codecov[bot] commented 3 years ago

Codecov Report

Merging #28 (bfae0a2) into main (d04fc5a) will decrease coverage by 10.00%. The diff coverage is 0.00%.

:exclamation: Current head bfae0a2 differs from pull request most recent head 360b697. Consider uploading reports for the commit 360b697 to get more accurate results

@@             Coverage Diff              @@
##              main      #28       +/-   ##
============================================
- Coverage   100.00%   90.00%   -10.00%     
============================================
  Files            4        4               
  Lines          235      230        -5     
============================================
- Hits           235      207       -28     
- Misses           0       23       +23

Impacted Files	Coverage Δ
src/PokerHandEvaluator.jl	`61.53% <0.00%> (-38.47%)`	:arrow_down:
src/evaluate5.jl	`96.77% <0.00%> (-3.23%)`	:arrow_down:
src/HandCombinations.jl	`97.29% <0.00%> (-2.71%)`	:arrow_down:
src/best_hand_rank_from_n_cards.jl	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d04fc5a...360b697. Read the comment docs.

Moelf commented 3 years ago

thanks for the super quick PR, I appreciate it!

first of all:

but I'm not sure why it's changing so much each time.

this is because @btime shows the least amount of time so it's luck-based, @benchmark would show you this.

julia> @benchmark CompactHandEval(sample_cards(7))
BenchmarkTools.Trial:
  memory estimate:  16.94 MiB
  allocs estimate:  268010
  --------------
  minimum time:     124.930 ms (0.00% GC)
  median time:      263.639 ms (3.89% GC)
  mean time:        269.180 ms (2.87% GC)
  maximum time:     410.603 ms (2.72% GC)
  --------------
  samples:          20
  evals/sample:     1

I also discovered that we sample directly by using sample!, which makes things significantly faster:

 julia> const fd = full_deck();

julia> const cards_buffer = Vector{Card}(undef, 7)
7-element Vector{Card}:

julia> @benchmark sample!($fd, cards_buffer; replace=false)
BenchmarkTools.Trial:
  memory estimate:  496 bytes
  allocs estimate:  1
  --------------
  minimum time:     192.065 ns (0.00% GC)
  median time:      204.055 ns (0.00% GC)
  mean time:        213.224 ns (1.62% GC)
  maximum time:     1.015 μs (71.41% GC)
  --------------
  samples:          10000
  evals/sample:     600

but yeah, the memory allocation is still staggering:

julia> @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))
BenchmarkTools.Trial:
  memory estimate:  1.29 MiB
  allocs estimate:  18700
  --------------
  minimum time:     20.113 ms (0.00% GC)
  median time:      164.844 ms (0.00% GC)
  mean time:        165.933 ms (2.72% GC)
  maximum time:     305.649 ms (4.25% GC)
  --------------
  samples:          31
  evals/sample:     1

charleskawczynski commented 3 years ago

thanks for the super quick PR, I appreciate it!

:thumbsup:, I'm almost at a point where I'm starting to think about performance in TexasHoldem.jl.

I also discovered that we sample directly by using sample!, which makes things significantly faster:

Nice!

but yeah, the memory allocation is still staggering:

True. A last resort might be to adjust PlayingCards.jl to be more like Stephan's Cards.jl, but I'd prefer to stick with the existing representation if we can make it work. I'm still seeing a decrease in allocations if @benchmark is called multiple times:

julia> @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))
BenchmarkTools.Trial:
  memory estimate:  2.63 MiB
  allocs estimate:  41205
  --------------
  minimum time:     45.936 ms (0.00% GC)
  median time:      118.131 ms (0.00% GC)
  mean time:        119.275 ms (0.47% GC)
  maximum time:     225.018 ms (2.30% GC)
  --------------
  samples:          42
  evals/sample:     1

julia> @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))
BenchmarkTools.Trial:
  memory estimate:  565.00 KiB
  allocs estimate:  8745
  --------------
  minimum time:     2.716 ms (0.00% GC)
  median time:      49.843 ms (0.00% GC)
  mean time:        54.908 ms (0.50% GC)
  maximum time:     161.007 ms (0.00% GC)
  --------------
  samples:          91
  evals/sample:     1

julia> @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))
BenchmarkTools.Trial:
  memory estimate:  2.11 KiB
  allocs estimate:  104
  --------------
  minimum time:     64.300 μs (0.00% GC)
  median time:      12.136 ms (0.00% GC)
  mean time:        19.564 ms (0.56% GC)
  maximum time:     97.030 ms (0.00% GC)
  --------------
  samples:          255
  evals/sample:     1

I don't really understand why, though.

I've even seen performance as good as:

julia> @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))
BenchmarkTools.Trial:
  memory estimate:  1.72 KiB
  allocs estimate:  79
  --------------
  minimum time:     30.300 μs (0.00% GC)
  median time:      52.900 μs (0.00% GC)
  mean time:        123.682 μs (0.50% GC)
  maximum time:     68.514 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

this seems much better. Is 1.72 KiB reasonable?

charleskawczynski commented 3 years ago

Going to continue with this soon because TexasHoldem.jl is about ready for performance profiling and, IIRC, the first time I did it pointed to hand evaluations and creating the table (which allocates a full deck).

charleskawczynski commented 3 years ago

I think we need to fix precompilation before addressing this since timing seems to be pretty dramatically affecting the timings.

charleskawczynski commented 3 years ago

using StatsBase, BenchmarkTools, PlayingCards, PokerHandEvaluator
const fd = full_deck();
const cards_buffer = Vector{Card}(undef, 7);
@benchmark sample!($fd, cards_buffer; replace=false)
@benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))

charleskawczynski commented 3 years ago

After updating to the new PlayingCards, I'm seeing (it looks prettier in the REPL with color-coding):

julia> using StatsBase, BenchmarkTools, PlayingCards, PokerHandEvaluator

julia> const fd = full_deck();

julia> const cards_buffer = Vector{Card}(undef, 7);

julia> @benchmark sample!($fd, cards_buffer; replace=false)
BechmarkTools.Trial: 10000 samples with 905 evaluations.
 Range (min … max):  131.996 ns … 602.459 ns  ┊ GC (min … max): 0.00% … 61.41%
 Time  (median):     137.305 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   145.675 ns ±  30.364 ns  ┊ GC (mean ± σ):  1.58% ±  6.06%

   █▇▃▁     ▁  ▁▂▁▂                                             ▁
  ██████▇██▇███████████▇▆▅▇█▇█▇▇▇▇▇▆▆▆▅▆▆▆▆▅▅▄▄▄▄▄▄▅▄▃▄▄▄▁▁▁▁▁▄ █
  132 ns        Histogram: log(frequency) by time        301 ns <

 Memory estimate: 496 bytes, allocs estimate: 1.

julia> @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  14.139 μs … 579.861 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.190 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.879 μs ±  16.970 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▃▄▅▆▇█████▇▇▆▅▅▄▃▃▂▂▁▁▁▁▁                                ▃
  ▄▆████████████████████████████▇█▇▇████▇██▇▇▆▇▇▇▇▆▆▆▆▆▆▇▆▅▄▅▆ █
  14.1 μs       Histogram: log(frequency) by time        78 μs <

 Memory estimate: 640 bytes, allocs estimate: 10.

So, the allocations are way better. Performance seems okay and, fortunately, repeating @benchmark CompactHandEval(sample!($fd, $cards_buffer; replace=false)) doesn't seem to significantly change the results as it did before.

charleskawczynski commented 3 years ago

On main (without the view interface), we have:

julia> @benchmark CompactHandEval(Tuple(sample!($fd, $cards_buffer; replace=false)))
BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range (min … max):  11.208 μs … 243.429 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.165 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.308 μs ±   7.809 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▃▅▇▇██▇▅▂                                        
  ▂▂▁▂▂▂▂▂▃▄▅▆███████████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  11.2 μs         Histogram: frequency by time         61.7 μs <

 Memory estimate: 784 bytes, allocs estimate: 19.

Which seems to have slightly more allocations compared to the view interface, but the runtime is about the same. Maybe this whole issue was simply related to the PlayingCards memory layout (which is now much more compact) and type-stability (which, we're now type-stable using the new PlayingCards)? I guess it wouldn't hurt to add view support, but the tuple interface seems to work fine. @Moelf, are there are any other benefits to supporting views?

charleskawczynski commented 3 years ago

I think, between #33, #35, #36, we don't need this anymore. We can reopen if there's good reason, but performance seems to be way better now, and I don't think the view will help much since each card has a much smaller memory footprint.

charleskawczynski / PokerHandEvaluator.jl

Add SubArray interface for better mem manegement #28

Codecov Report