Persistent L2 cache opt for cuda backend

goal is to fit activations of residual block in L2 cache.
around 6.7% improvement in T80 networks.
Expected to only help Ada GPUs with larger L2 cache capacities.

With 807647 network on RTX 4090 -

Before:

Benchmark batch size 32 with inference average time 1.58066ms - throughput 20244.6 nps.
Benchmark batch size 64 with inference average time 2.01741ms - throughput 31723.8 nps.
Benchmark batch size 96 with inference average time 2.32485ms - throughput 41293 nps.
Benchmark batch size 128 with inference average time 2.94163ms - throughput 43513.3 nps.
Benchmark batch size 160 with inference average time 3.54371ms - throughput 45150.4 nps.
Benchmark batch size 192 with inference average time 4.34863ms - throughput 44151.8 nps.
Benchmark batch size 224 with inference average time 4.87263ms - throughput 45971.1 nps.
Benchmark batch size 256 with inference average time 5.96631ms - throughput 42907.6 nps.

After:

Benchmark batch size 32 with inference average time 1.51683ms - throughput 21096.6 nps.
Benchmark batch size 64 with inference average time 2.08826ms - throughput 30647.5 nps.
Benchmark batch size 96 with inference average time 2.08941ms - throughput 45945.9 nps.
Benchmark batch size 128 with inference average time 2.75917ms - throughput 46390.8 nps.
Benchmark batch size 160 with inference average time 3.32101ms - throughput 48178.1 nps.
Benchmark batch size 192 with inference average time 4.02509ms - throughput 47700.8 nps.
Benchmark batch size 224 with inference average time 4.85234ms - throughput 46163.3 nps.
Benchmark batch size 256 with inference average time 5.94875ms - throughput 43034.3 nps.

LeelaChessZero / lc0

Persistent L2 cache opt for cuda backend #1815