goal is to fit activations of residual block in L2 cache.
around 6.7% improvement in T80 networks.
Expected to only help Ada GPUs with larger L2 cache capacities.
With 807647 network on RTX 4090 -
Before:
Benchmark batch size 32 with inference average time 1.58066ms - throughput 20244.6 nps.
Benchmark batch size 64 with inference average time 2.01741ms - throughput 31723.8 nps.
Benchmark batch size 96 with inference average time 2.32485ms - throughput 41293 nps.
Benchmark batch size 128 with inference average time 2.94163ms - throughput 43513.3 nps.
Benchmark batch size 160 with inference average time 3.54371ms - throughput 45150.4 nps.
Benchmark batch size 192 with inference average time 4.34863ms - throughput 44151.8 nps.
Benchmark batch size 224 with inference average time 4.87263ms - throughput 45971.1 nps.
Benchmark batch size 256 with inference average time 5.96631ms - throughput 42907.6 nps.
After:
Benchmark batch size 32 with inference average time 1.51683ms - throughput 21096.6 nps.
Benchmark batch size 64 with inference average time 2.08826ms - throughput 30647.5 nps.
Benchmark batch size 96 with inference average time 2.08941ms - throughput 45945.9 nps.
Benchmark batch size 128 with inference average time 2.75917ms - throughput 46390.8 nps.
Benchmark batch size 160 with inference average time 3.32101ms - throughput 48178.1 nps.
Benchmark batch size 192 with inference average time 4.02509ms - throughput 47700.8 nps.
Benchmark batch size 224 with inference average time 4.85234ms - throughput 46163.3 nps.
Benchmark batch size 256 with inference average time 5.94875ms - throughput 43034.3 nps.
With 807647 network on RTX 4090 -
Before:
After: