PeteBassettBet365 commented 2 years ago

This pull request covers some changes to improve performance in the tdsBuffer struct and ucs22str function.

buf.go The tdsBuffer uint16 function is used heavily throughout the tds parsing code. The original implementation created a byte array and a slice into it to be filled by a call to ReadFull. This calls through io.ReadFull back into tdsBuffer io.Reader Read function. This code path incurs significant overhead when the buffer already has at least 2 bytes available. The new implementation checks the read position before calling ReadFull. If we have at least two bytes available we read directly, otherwise fall back to the slow path. uint32 and uint64 have also been changed in the same way for 4 & 8 byte reads.

uint16 - a loop calling uint16 10000 times BenchmarkReadUint16Implementations/reference-12 4010 304633 ns/op 20045 B/op 10001 allocs/op BenchmarkReadUint16Implementations/new_version-12 23412 50691 ns/op 32 B/op 1 allocs/op

uint32 - a loop calling uint32 5000 times BenchmarkReadUint32Implementations/reference-12 6684 154147 ns/op 20045 B/op 5001 allocs/op BenchmarkReadUint32Implementations/new_version-12 48321 25327 ns/op 32 B/op 1 allocs/op

uint64 - a loop calling uint64 2500 times BenchmarkReadUint64Implementations/reference-12 3795 87688 ns/op 20045 B/op 2501 allocs/op BenchmarkReadUint64Implementations/new_version-12 85941 14582 ns/op 32 B/op 1 allocs/op

tds.go - performance improvement on ucs22str function. It is responsible for reading 16bit ucs2 characters and producing a string. Most of the strings we deal with are ascii so the full unicode handling path can be bypassed when they are found. We use unsafe to read the incoming bytes in 8 byte chunks, as uint64s, masking the value to find invalid ascii. If found we fall back to a faster version of the original implementation to deal with the unicode. Otherwise we have 8 bytes and 4 can be safely added to the output buffer. If we reach the end of the input successfully the buffer can be converted directly to a string. This reduces allocations on the ascii path and is a significant speed improvement, up to 16x in our tests. The faster unicode path avoids copying the byte data to a newly allocated uint16 array. It again uses unsafe, to create a uint16 slice to the original byte data. This is safe because it is passed to directly utf16.Decode, which itself creates a copy as a []rune. This saves us an allocation, copying and uint16 conversion.

Reference Implementation BenchmarkUcs22strReferenceAscii-12 12274105 98.6 ns/op 32 B/op 3 allocs/op BenchmarkUcs22strReferenceMediumAscii-12 6721837 178 ns/op 96 B/op 3 allocs/op BenchmarkUcs22strReferenceLongAscii-12 1538629 778 ns/op 448 B/op 3 allocs/op BenchmarkUcs22strReferenceLongerAscii-12 169466 7115 ns/op 4608 B/op 3 allocs/op BenchmarkUcs22strReferenceTrailingUnicode-12 4423173 274 ns/op 160 B/op 3 allocs/op BenchmarkUcs22strReferenceLongEmojis-12 3166297 380 ns/op 304 B/op 3 allocs/op

New Implementation BenchmarkUcs22strAscii-12 54690384 21.9 ns/op 3 B/op 1 allocs/op BenchmarkUcs22strMediumAscii-12 35389251 33.4 ns/op 16 B/op 1 allocs/op BenchmarkUcs22strLongAscii-12 17188800 70.1 ns/op 64 B/op 1 allocs/op BenchmarkUcs22strLongerAscii-12 2864568 427 ns/op 640 B/op 1 allocs/op BenchmarkUcs22strTrailingUnicode-12 4221280 259 ns/op 144 B/op 3 allocs/op BenchmarkUcs22strLongEmojis-12 3549286 341 ns/op 272 B/op 3 allocs/op

buf.go A tdsBuffer is created for each sql connection and internally holds two []byte buffers, one for reading and one for writing. They are allocated directly with make. Each time a connection is closed, they are garbage collected. We added a sync.Pool managed pool of 64k []byte slices. These are created as required and placed back in the pool when the tdsBuffer is closed. Each pooled slice is used for the read and write buffers by reslicing half for each. Connections are generally long lived so this is a minor improvement to garbage collection overhead at the expense of slightly increased memory usage. Happy to discuss this further if you deem it not required.

All benchmarks done on an 8core i7, 16gb

codecov[bot] commented 2 years ago

Codecov Report

Merging #717 (05e7d0d) into master (e538731) will increase coverage by 0.48%. The diff coverage is 97.93%.

@@            Coverage Diff             @@
##           master     #717      +/-   ##
==========================================
+ Coverage   71.29%   71.78%   +0.48%     
==========================================
  Files          24       24              
  Lines        5386     5511     +125     
==========================================
+ Hits         3840     3956     +116     
- Misses       1301     1308       +7     
- Partials      245      247       +2

Impacted Files	Coverage Δ
buf.go	`98.45% <93.02%> (-1.55%)`	:arrow_down:
mssql.go	`87.76% <100.00%> (+0.01%)`	:arrow_up:
tds.go	`68.76% <100.00%> (+2.90%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e538731...05e7d0d. Read the comment docs.

moredure commented 2 years ago

Also bufpool.Put will allocate byte slice header escapes to the heap...

PeteBassettBet365 commented 2 years ago

Also bufpool.Put will allocate byte slice escapes to the heap...

Hi. The original slices in the tdsBuffer were also on the heap, but they were garbage collected, we now reuse them due to the pool.

Is there any other feedback about this PR?

denisenkom / go-mssqldb

Performance Improvement in tds and ucs2 parsing #717

Codecov Report