Buffers

Buffers are a crucial backbone to networking. From java.nio.ByteBuffer to io.netty.buffer.ByteBuf with its many options, along with a potential custom implementation, there are many to choose from. It's crucial we select the best performing one for each occasion.

Candidates

Let's explore the potential candidates for buffers, showing their strengths and weaknesses, and where they might in theory be best to use.

[!TIP] Contrary to popular belief, there are no performance benefits to using direct buffers over heap buffers if native operations are not involved - meaning reads and writes within the JVM will yield no increased performance.

Direct Buffers

Direct buffers have a strength of being optimal to use for I/O operations, as they can be used by the native I/O operations directly, bypassing any mechanisms where the buffer needs to be copied over. Heap buffers do not get this benefit, and must always be copied over to a direct byte buffer, before native I/O can occur. The downside of direct buffers, however, is that they are more costly to allocate. Every allocated direct buffer will first require all bytes in it to be set to 0. This is often not ideal, as we would be overwriting those bytes ourselves regardless. As noted in the tip above, direct buffers have no distinct advantages when not used natively. The benefits are only received at the native I/O level, or alternatively, when sun.misc.Unsafe is used to alter the memory. The main benefits of using Unsafe are received through the lack of bound checking, which - as the name suggests - makes it risky and unsafe to use.

So, where should direct buffers be used? The most optimal place for direct buffers to be used is within Netty's channel handlers, as the buffers there will be used for I/O directly. There's no reason to use them for regular packet encoding or decoding, as no native operations will be performed on these intermediate buffers, and they are significantly costlier to allocate. There is, however, a second place where we could explore using direct buffers - internally within our library for Player and NPC infos' intermediate buffers. By utilizing sun.misc.Unsafe for direct modifications, if our code can guarantee it will remain within boundaries for larger slices at a time, and not the usual byte-by-byte implementation, we could gain significant performance improvements. This should be done extremely carefully and should be an opt-in experimental practice, and not the default.

Heap Buffers

Heap buffers have a strength in allocation speed. Compared to direct buffers, they can be as much as 100 times faster to allocate. When compared against direct buffers within the JVM (ie reading and writing bytes through the JVM, rather than doing native copies), they perform at a similar rate. It should be noted that CPU caching can play a pivotal role in benchmarks when comparing the two.

So, where should heap buffers be used? Heap buffers are best used for any operations within the JVM, such as encoding and decoding the packet buffers. They are significantly quicker to allocate, too. In order to reduce garbage created, our best strategy is to use pooling for var-byte packets, and simply allow the JVM to garbage collect any fixed buffers, which are almost universally all very small in size. The idea here is that the pooling mechanism will likely be more costly than just allocating and deallocating these buffers on-demand. In the case of var-byte and var-short packets, we would be allocating 256 and 40,000 byte buffers respectively - these are both large enough to warrant using the pooling mechanism.

Custom Buffers

Due to the complexity of the Player Info and Npc Info packets, it is very likely we will be utilizing smaller temporary buffers for caching and pre-computations. These buffers will often have a fixed size and require no memory bound checking. as a result of that, we may find it more beneficial to use a custom direct or heap buffer, depending on the circumstances. The idea here is that we can bypass any boundary checking that the JVM and/or Netty go through with each byte written. Due to CPU caching being especially efficient with small sectors of memory, we may find performance benefits in the tens or hundreds of times compared to the traditional methods. This is all hypothetical, however, and extensive tests need to be performed in order to be able to tell whether it is worth exploring further.

Bit Buffers

As a compression mechanism, four packets (PLAYER_INFO, NPC_INFO, REBUILD_LOGIN & REBUILD_REGION) in OSRS as of right now utilize bit buffers. These are wrappers around byte buffers that allow us to write at a per-bit basis, instead of the typical per-byte. As the library will only be using the writing mechanism of bit buffers, our focus will be with writing alone. We can create bit buffers which are capable of reading too, but the performance of those will be secondary. As bit buffers require continuous reading and writing at a byte-per-byte basis, we have two primary candidates for utilizing bit buffers:

Unsafe
- While risky to use, unsafe could yield significant benefits by eliminating any bound checking. This does however mean that our application must do all the necessary checks on its own, whether they be implicit or explicit. If we still end up bound-checking with every write call, the benefits to using Unsafe will be negligible.
Direct byte arrays
- One of the down sides to Netty's heap buffers is the extensive chain of function calls and bound checking involved. By accessing the backing byte array directly, we could eliminate all of those function calls and many of the bound checks on our own, in larger batches. This could also yield significant performance benefits, although less than the riskier alternative of Unsafe.

Bit Buffer's memory allocations

A common implementation of bit buffers involves making a wrapper class that contains the bit reader and writer indices. While not significant by any means, it does mean we generate a little bit of garbage with each packet that uses bit buffers. As an additional problem, the indices and the backing memory itself may be placed far apart, reducing memory locality and making CPU caching less efficient.

[!NOTE] In a napkin benchmark[^1] with length-60 byte array, utilizing the byte array itself to store the writer index resulted in a ~1.8% performance increase over having the writer index be a separate int stored in a class. This can be attributed to memory locality.

Below is an example use case of such buffer:

@JvmSynthetic
inline fun ByteBuf.useInlined(block: InlineBitBuf.() -> Unit) {
    val buffer = InlineBitBuf(this)
    buffer.start()
    block(buffer)
    buffer.end()
}

fun example(buffer: ByteBuf) {
    buffer.useInlined {
        pBits(5, 10)
    }
}

By utilizing the inlined function, there would be no additional allocations made while calling it from Kotlin. If utilizing it from the JVM world, we need a SAM implementation instead. It is worth noting however, that out-of-the-box implementations would all be within the library, in kotlin, and we would always utilize the inlined buffer. The SAM-implementation would only be needed in case the developer needs a custom packet that happens to utilize bit buffers.

fun interface BitBufferConsumer {
    fun use(buffer: InlineBitBuf)
}

fun ByteBuf.use(consumer: BitBufferConsumer) {
    val buffer = InlineBitBuf(this)
    buffer.start()
    consumer.use(buffer)
    buffer.end()
}

The SAM variant will be usable from within Kotlin too, although an anonymous object is created for each bit buffer usage.

[!NOTE] The @JvmSynthetic annotation would be used to avoid generating a JVM-compatible non-inlined static variant of the useInlined function, as the syntax for that is sub-optimal and SAM-compatible variants are far nicer to use from the JVM.

Below is an example implementation of the InlineBitBuf class. This class uses Graham's implementation of pBits and gBits, with minor changes:

[!IMPORTANT] The performance will need to be benchmarked at a later date. Optimizations should be made to ensure the bit buffer is as fast as it possibly can be, due to the incredibly high number of calls it will experience.

InlineBitBuf Snippet

```kt @JvmInline value class InlineBitBuf @PublishedApi internal constructor(private val buffer: ByteBuf) { private var readerIndex: Int get() = buffer.getInt(0) set(value) { buffer.setInt(0, value) } private var writerIndex: Int get() = buffer.getInt(Int.SIZE_BYTES) set(value) { buffer.setInt(Int.SIZE_BYTES, value) } @PublishedApi internal fun start() { val writerIndex = buffer.writerIndex() val readerIndex = buffer.readerIndex() if (writerIndex == 0 && readerIndex == 0) { buffer.writeInt(BITS_PER_BYTE * BITS_PER_BYTE) buffer.writeInt(BITS_PER_BYTE * BITS_PER_BYTE) buffer.readerIndex(BITS_PER_BYTE) } else { check(writerIndex >= BITS_PER_BYTE) { "Corrupt buffer: $writerIndex" } buffer.setInt(0, writerIndex * BITS_PER_BYTE) buffer.setInt(Int.SIZE_BYTES, readerIndex * BITS_PER_BYTE) } } @PublishedApi internal fun end() { val writerIndex = this.writerIndex this.writerIndex = 0 val bits = (((writerIndex + MASK_BITS_PER_BYTE) and MASK_BITS_PER_BYTE.inv()) - writerIndex) if (bits != 0) { setBits(writerIndex, bits, 0) } buffer.writerIndex((writerIndex + bits) shr LOG_BITS_PER_BYTE) val readerIndex = (readerIndex + MASK_BITS_PER_BYTE) and MASK_BITS_PER_BYTE.inv() this.readerIndex = 0 buffer.readerIndex(readerIndex shr LOG_BITS_PER_BYTE) } fun pBits( len: Int, value: Int, ) { setBits( this.writerIndex, len, value, ) this.writerIndex += len } private fun capacity(): Int { return buffer.capacity() shl LOG_BITS_PER_BYTE } private fun setBits( index: Int, len: Int, value: Int, ) { require(len in 1..BITS_PER_INT) if (index < 0 || (index + len) > capacity()) { throw IndexOutOfBoundsException() } var remaining = len var byteIndex = index shr LOG_BITS_PER_BYTE var bitIndex = index and MASK_BITS_PER_BYTE while (remaining > 0) { val n = min(BITS_PER_BYTE - bitIndex, remaining) val shift = (BITS_PER_BYTE - (bitIndex + n)) and MASK_BITS_PER_BYTE val mask = (1 shl n) - 1 var v = buffer.getUnsignedByte(byteIndex).toInt() v = v and (mask shl shift).inv() v = v or (((value shr (remaining - n)) and mask) shl shift) buffer.setByte(byteIndex, v) remaining -= n byteIndex++ bitIndex = 0 } } fun gBits(len: Int): Int { require(len in 1..BITS_PER_INT) val index = readerIndex if (index < 0 || (index + len) > capacity()) { throw IndexOutOfBoundsException() } var value = 0 var remaining = len var byteIndex = index shr LOG_BITS_PER_BYTE var bitIndex = index and MASK_BITS_PER_BYTE while (remaining > 0) { val n = min(BITS_PER_BYTE - bitIndex, remaining) val shift = (BITS_PER_BYTE - (bitIndex + n)) and MASK_BITS_PER_BYTE val mask = (1 shl n) - 1 val v = buffer.getUnsignedByte(byteIndex).toInt() value = value shl n value = value or ((v shr shift) and mask) remaining -= n byteIndex++ bitIndex = 0 } this.readerIndex += len return value } private companion object { private const val LOG_BITS_PER_BYTE = 3 private const val BITS_PER_BYTE = 1 shl LOG_BITS_PER_BYTE private const val MASK_BITS_PER_BYTE = BITS_PER_BYTE - 1 private const val BITS_PER_INT = 32 } } ```

[^1]: Napkin benchmark refer to benchmarks that were done without the JMH setup, in a simple snippet of iterative code using Kotlin's measureTime functionality. These benchmarks may not necessarily be accurate as there is a considerable amount of background noise and the process lasts significantly less than typical benchmarks.

blurite / rsprot

Buffers #4