CoreyKaylor / Lightning.NET

.NET library for LMDB key-value store
Other
397 stars 82 forks source link

Add span overloads to API #122

Closed AlgorithmsAreCool closed 4 years ago

AlgorithmsAreCool commented 4 years ago

This PR closes #116. It surfaces Span overloads for most operations and should improve both read and write performance primarily by reducing GC allocations

AlgorithmsAreCool commented 4 years ago

Before :


BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
Method OpsPerTransaction ValueSize KeyOrder Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Read 1 8 Sequential 424.5 ns 8.15 ns 7.62 ns 0.0710 - - 336 B
Read 1 64 Sequential 429.0 ns 8.51 ns 13.00 ns 0.0830 - - 392 B
Read 1 256 Sequential 441.4 ns 7.69 ns 7.19 ns 0.1240 - - 584 B
Read 100 8 Sequential 12,507.1 ns 244.10 ns 372.77 ns 0.7324 - - 3504 B
Read 100 64 Sequential 15,080.6 ns 288.58 ns 364.96 ns 1.9226 - - 9104 B
Read 100 256 Sequential 16,760.2 ns 331.27 ns 515.74 ns 6.0120 - - 28304 B
Read 1000 8 Sequential 173,016.9 ns 3,397.00 ns 8,332.91 ns 6.8359 - - 32304 B
Read 1000 64 Sequential 179,190.1 ns 3,554.55 ns 5,428.18 ns 18.5547 - - 88304 B
Read 1000 256 Sequential 201,845.5 ns 3,963.95 ns 6,512.88 ns 59.5703 - - 280304 B

After


BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
Method OpsPerTransaction ValueSize KeyOrder Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Read 1 8 Sequential 426.7 ns 8.46 ns 11.00 ns 0.0710 - - 336 B
Read 1 64 Sequential 437.0 ns 8.76 ns 15.79 ns 0.0830 - - 392 B
Read 1 256 Sequential 445.2 ns 8.53 ns 10.79 ns 0.1240 - - 584 B
Read 100 8 Sequential 12,377.9 ns 236.80 ns 324.14 ns 0.7324 - - 3504 B
Read 100 64 Sequential 14,276.2 ns 281.75 ns 438.66 ns 1.9226 - - 9104 B
Read 100 256 Sequential 16,474.7 ns 324.40 ns 485.55 ns 6.0120 - - 28304 B
Read 1000 8 Sequential 165,299.1 ns 3,158.93 ns 3,243.99 ns 6.8359 - - 32304 B
Read 1000 64 Sequential 179,787.0 ns 3,389.49 ns 3,170.53 ns 18.5547 - - 88305 B
Read 1000 256 Sequential 220,032.9 ns 5,844.55 ns 17,048.84 ns 59.5703 - - 280304 B

After (with new non allocating API)


BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.5.20279.10
  [Host]     : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
  DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT
Method OpsPerTransaction ValueSize KeyOrder Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Read 1 8 Sequential 449.6 ns 8.88 ns 14.83 ns 0.0644 - - 304 B
Read 1 64 Sequential 435.0 ns 8.72 ns 21.88 ns 0.0644 - - 304 B
Read 1 256 Sequential 415.3 ns 7.60 ns 8.44 ns 0.0644 - - 304 B
Read 100 8 Sequential 12,363.8 ns 240.95 ns 337.78 ns 0.0610 - - 304 B
Read 100 64 Sequential 13,812.4 ns 272.65 ns 505.38 ns 0.0610 - - 304 B
Read 100 256 Sequential 14,835.0 ns 293.48 ns 521.66 ns 0.0610 - - 304 B
Read 1000 8 Sequential 164,105.6 ns 3,256.16 ns 5,529.21 ns - - - 304 B
Read 1000 64 Sequential 172,565.8 ns 3,365.86 ns 5,715.49 ns - - - 304 B
Read 1000 256 Sequential 173,051.3 ns 3,410.59 ns 7,557.63 ns - - - 304 B
AlgorithmsAreCool commented 4 years ago

I need to do another pass over the documentation and do the write benchmarks

I'll take care of those tomorrow, it's late

CoreyKaylor commented 4 years ago

Seriously, well done.

AlgorithmsAreCool commented 4 years ago

I've spent some time today seeing if there was any low hanging fruit in terms of perf:

In the table below:

The good news is that at peak, we are hitting 5M+ reads per second depending on batch and value sizes. This translates to read throughput of over 1.2GB/s. Not too shabby. Also, the overhead of copying the value into an array is pretty low, this hurts my case for #121

The not so great is that the overhead of the transaction is felt more than I expected. The common pattern of single reads per transaction is giving up about half of it's performance to library overhead. I'm a little shocked by this result because the overhead was only about 15% or so when using sequential keys and a smaller database. This test uses random keys and 256K entries in the DB instead of just 1000.

But even the "worst case" is still ~2M reads per second. That ain't bad at all. Also, even though we are allocating 300 bytes per read, they are all clustered in Gen0 so the GC not incurring much pain from it.

I'll probably create another issue to poke at it some more after I address feedback and remove those warnings i forgot about in this one.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.836 (1909/November2018Update/19H2) Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET Core SDK=5.0.100-preview.5.20279.10 [Host] : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT DefaultJob : .NET Core 3.1.5 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.27001), X64 RyuJIT

Method OpsPerTransaction DatabaseSize ValueSize KeyOrder Mean Error StdDev Throughput Memory Throughput Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Read 1 262144 8 Random 513.3 ns 9.67 ns 14.17 ns 1,948,162.52 / s 15,585,300.18 MB/s 513.3 ns 2.05 0.07 0.0639 - - 304 B
ReadNoCopy 1 262144 8 Random 497.6 ns 9.95 ns 18.93 ns 2,009,765.73 / s 16,078,125.87 MB/s 491.2 ns 1.98 0.07 0.0639 - - 304 B
ReadMinimal 1 262144 8 Random 250.9 ns 5.05 ns 7.71 ns 3,985,211.15 / s 31,881,689.24 MB/s 251.1 ns 1.00 0.00 - - - -
Read 1 262144 64 Random 492.7 ns 9.69 ns 13.59 ns 2,029,661.31 / s 129,898,323.92 MB/s 488.3 ns 2.08 0.07 0.0639 - - 304 B
ReadNoCopy 1 262144 64 Random 492.4 ns 9.53 ns 13.35 ns 2,030,757.10 / s 129,968,454.16 MB/s 495.6 ns 2.08 0.08 0.0639 - - 304 B
ReadMinimal 1 262144 64 Random 237.8 ns 4.77 ns 6.36 ns 4,205,150.29 / s 269,129,618.80 MB/s 237.8 ns 1.00 0.00 - - - -
Read 1 262144 256 Random 498.5 ns 9.97 ns 14.61 ns 2,005,850.63 / s 513,497,761.70 MB/s 503.6 ns 2.04 0.10 0.0639 - - 304 B
ReadNoCopy 1 262144 256 Random 494.2 ns 9.80 ns 16.63 ns 2,023,338.34 / s 517,974,614.30 MB/s 495.7 ns 2.00 0.13 0.0644 - - 304 B
ReadMinimal 1 262144 256 Random 252.6 ns 5.03 ns 12.24 ns 3,958,301.20 / s 1,013,325,108.30 MB/s 254.3 ns 1.00 0.00 - - - -
Read 100 262144 8 Random 19,900.9 ns 389.37 ns 594.61 ns 5,024,900.04 / s 40,199,200.35 MB/s 20,074.8 ns 1.26 0.04 0.0610 - - 304 B
ReadNoCopy 100 262144 8 Random 19,488.5 ns 377.90 ns 912.68 ns 5,131,219.32 / s 41,049,754.55 MB/s 19,426.4 ns 1.20 0.07 0.0610 - - 304 B
ReadMinimal 100 262144 8 Random 15,874.3 ns 298.27 ns 319.14 ns 6,299,496.24 / s 50,395,969.92 MB/s 15,858.7 ns 1.00 0.00 - - - -
Read 100 262144 64 Random 20,338.6 ns 402.07 ns 774.65 ns 4,916,754.83 / s 314,672,309.30 MB/s 20,428.5 ns 1.18 0.05 0.0610 - - 304 B
ReadNoCopy 100 262144 64 Random 18,794.6 ns 373.90 ns 445.10 ns 5,320,667.67 / s 340,522,731.20 MB/s 18,747.5 ns 1.10 0.05 0.0610 - - 304 B
ReadMinimal 100 262144 64 Random 17,299.3 ns 344.10 ns 584.31 ns 5,780,576.01 / s 369,956,864.43 MB/s 17,583.2 ns 1.00 0.00 - - - -
Read 100 262144 256 Random 21,189.2 ns 417.04 ns 390.10 ns 4,719,374.46 / s 1,208,159,862.97 MB/s 21,199.8 ns 1.20 0.03 0.0610 - - 304 B
ReadNoCopy 100 262144 256 Random 20,382.1 ns 402.55 ns 523.43 ns 4,906,272.15 / s 1,256,005,670.03 MB/s 20,381.3 ns 1.15 0.03 0.0610 - - 304 B
ReadMinimal 100 262144 256 Random 17,678.7 ns 350.80 ns 344.53 ns 5,656,513.21 / s 1,448,067,381.00 MB/s 17,541.7 ns 1.00 0.00 - - - -
Read 1000 262144 8 Random 271,680.7 ns 5,404.78 ns 8,880.21 ns 3,680,791.47 / s 29,446,331.78 MB/s 270,716.9 ns 1.09 0.04 - - - 304 B
ReadNoCopy 1000 262144 8 Random 257,954.2 ns 4,940.35 ns 5,073.38 ns 3,876,657.19 / s 31,013,257.49 MB/s 257,343.4 ns 1.06 0.04 - - - 304 B
ReadMinimal 1000 262144 8 Random 248,480.0 ns 4,924.56 ns 8,227.83 ns 4,024,468.93 / s 32,195,751.46 MB/s 247,993.7 ns 1.00 0.00 - - - -
Read 1000 262144 64 Random 279,058.9 ns 5,554.03 ns 7,024.07 ns 3,583,473.35 / s 229,342,294.41 MB/s 281,347.1 ns 1.06 0.04 - - - 304 B
ReadNoCopy 1000 262144 64 Random 271,230.2 ns 5,180.11 ns 4,845.48 ns 3,686,904.78 / s 235,961,906.02 MB/s 272,830.3 ns 1.03 0.02 - - - 304 B
ReadMinimal 1000 262144 64 Random 263,690.2 ns 5,026.03 ns 4,701.35 ns 3,792,328.96 / s 242,709,053.15 MB/s 265,527.0 ns 1.00 0.00 - - - -
Read 1000 262144 256 Random 300,485.7 ns 5,819.14 ns 5,975.82 ns 3,327,944.95 / s 851,953,908.31 MB/s 302,726.5 ns 1.09 0.04 - - - 304 B
ReadNoCopy 1000 262144 256 Random 284,745.1 ns 5,492.79 ns 6,105.23 ns 3,511,912.41 / s 899,049,576.07 MB/s 287,330.3 ns 1.03 0.02 - - - 304 B
ReadMinimal 1000 262144 256 Random 276,339.5 ns 5,475.50 ns 5,858.72 ns 3,618,736.54 / s 926,396,553.86 MB/s 278,851.8 ns 1.00 0.00 - - - 1 B
Read 10000 262144 8 Random 3,156,183.7 ns 28,290.50 ns 26,462.95 ns 3,168,383.35 / s 25,347,066.80 MB/s 3,156,604.1 ns 1.09 0.01 - - - 306 B
ReadNoCopy 10000 262144 8 Random 3,082,680.0 ns 26,803.50 ns 25,072.01 ns 3,243,930.62 / s 25,951,444.96 MB/s 3,075,674.0 ns 1.07 0.01 - - - 304 B
ReadMinimal 10000 262144 8 Random 2,886,256.1 ns 28,983.67 ns 24,202.68 ns 3,464,696.05 / s 27,717,568.39 MB/s 2,885,249.2 ns 1.00 0.00 - - - 5 B
Read 10000 262144 64 Random 3,978,827.5 ns 29,298.17 ns 27,405.53 ns 2,513,303.23 / s 160,851,406.60 MB/s 3,974,537.5 ns 1.06 0.01 - - - 314 B
ReadNoCopy 10000 262144 64 Random 3,869,549.2 ns 24,773.98 ns 23,173.60 ns 2,584,280.35 / s 165,393,942.24 MB/s 3,865,378.9 ns 1.03 0.01 - - - 306 B
ReadMinimal 10000 262144 64 Random 3,761,864.1 ns 26,050.39 ns 21,753.25 ns 2,658,256.56 / s 170,128,420.15 MB/s 3,758,513.3 ns 1.00 0.00 - - - 5 B
Read 10000 262144 256 Random 4,671,714.2 ns 35,431.55 ns 31,409.15 ns 2,140,541.91 / s 547,978,729.99 MB/s 4,671,530.5 ns 1.10 0.02 - - - 304 B
ReadNoCopy 10000 262144 256 Random 4,362,596.4 ns 27,258.76 ns 24,164.19 ns 2,292,212.97 / s 586,806,520.99 MB/s 4,352,195.3 ns 1.03 0.01 - - - 310 B
ReadMinimal 10000 262144 256 Random 4,231,405.8 ns 45,450.64 ns 40,290.81 ns 2,363,280.78 / s 604,999,879.20 MB/s 4,225,128.1 ns 1.00 0.00 - - - 37 B
CoreyKaylor commented 4 years ago

This is all good news. Sounds like there might be more to "squeeze" out, but we're in a much better place with these changes than we were. Looking forward to getting these changes in. Thanks for all the effort here.

CoreyKaylor commented 4 years ago

I'm planning on pulling this in tomorrow unless you have any objections?

AlgorithmsAreCool commented 4 years ago

Yeah, I don't object. I can land anything else in another PR. Sorry about the delay here

CoreyKaylor commented 4 years ago

Added you as a collaborator if you're interested. All I ask is bigger changes get discussed first. Regarding time, I get it, nothing to be sorry for.