LeelaChessZero / lc0

The rewritten engine, originally for tensorflow. Now all other backends have been ported here.
GNU General Public License v3.0
2.44k stars 529 forks source link

Persistent L2 cache opt for cuda backend #1815

Closed ankan-ban closed 1 year ago

ankan-ban commented 1 year ago

With 807647 network on RTX 4090 -

Before:

Benchmark batch size 32 with inference average time 1.58066ms - throughput 20244.6 nps.
Benchmark batch size 64 with inference average time 2.01741ms - throughput 31723.8 nps.
Benchmark batch size 96 with inference average time 2.32485ms - throughput 41293 nps.
Benchmark batch size 128 with inference average time 2.94163ms - throughput 43513.3 nps.
Benchmark batch size 160 with inference average time 3.54371ms - throughput 45150.4 nps.
Benchmark batch size 192 with inference average time 4.34863ms - throughput 44151.8 nps.
Benchmark batch size 224 with inference average time 4.87263ms - throughput 45971.1 nps.
Benchmark batch size 256 with inference average time 5.96631ms - throughput 42907.6 nps.

After:

Benchmark batch size 32 with inference average time 1.51683ms - throughput 21096.6 nps.
Benchmark batch size 64 with inference average time 2.08826ms - throughput 30647.5 nps.
Benchmark batch size 96 with inference average time 2.08941ms - throughput 45945.9 nps.
Benchmark batch size 128 with inference average time 2.75917ms - throughput 46390.8 nps.
Benchmark batch size 160 with inference average time 3.32101ms - throughput 48178.1 nps.
Benchmark batch size 192 with inference average time 4.02509ms - throughput 47700.8 nps.
Benchmark batch size 224 with inference average time 4.85234ms - throughput 46163.3 nps.
Benchmark batch size 256 with inference average time 5.94875ms - throughput 43034.3 nps.
borg323 commented 1 year ago

We may have to revise the default for cache_opt but for now true is fine for wider testing.