Open mtopolnik opened 5 years ago
Natural aka. smart batching is a technique in stream processing that optimizes throughput without affecting latency.
The above statement is reasonable but have to be proven for at least one channel type (ie for array channel).
In my opinion a receiveMany
method have to handle backpressure, ie: receive(10)
or receiveTo(buffer)
.
@mtopolnik Can you, please, share a use-case you have in mind. I do want to study those, because I'm also reluctant to add this kind of API to channels. These kinds of performance optimization are better kept hidden, as implementation details, of higher-lever APIs.
Let me explicitly mention #254 here. With abstraction for cold stream in place, data flows become declarative -- you declare your data processing pipeline and then launch it, without even having to explicitly use communication primitives like channels, etc. So, if your use-case is to generate a sequence of data items in one thread and consume them in another, then you would be able to use the following hypothetical API do it:
dataSource() // produces data in the context of one thread
.withContext(other)
.forEach { doSomethingWith(it) }
Now, behind the scenes, withContext
combinator will perform switching between the threads using extremely efficient naturally-batched single-produce single-consumer queue, without you, as its end user, ever having to think about it.
The issue is not the efficiency of inter-thread communication, but what you want to do with the items. Natural batching applies to any situation where you're performing an operation that:
This fits many use cases, basically every example where you deal with a remote system. It could be a database INSERT
, a REST service POST
etc. It even applies to within-process operations such as simple file writing, if the requirement is to use an open-write-close cycle.
If all I have is forEach
, I can't observe a batch of items as a unit. I need an API that gives me items batch-by-batch.
I can't build upon forEach
to achieve the goal, either, because I have no natural batch demarcation. In the steady state the consumer should be working at full speed, never spending any time waiting for a batch to form. Once it's done with the current batch, it immediately accepts all the items that have accumulated since the last time it asked for more.
Ok. Let me rephrase this, again, as a feature request for #254. You want to have an API like this:
dataSource()
.forEachBatch { doSomethingWithBatch(it) }
Yes, that would go a long way. One stumbling point here is the question of the ownership of the batch collection.
On the one hand there's the concern of robustness: surrendering the ownership to the consumer function on each invocation is the safest option.
On the other hand we have GC concerns: we'd like to reuse the same collection for all invocations of the consumer function.
It's hard to get both concerns satisfied in a simple API. Possibly the user could supply his own collection as a first, optional argument to forEachBatch
. This would help make the ownership explicit.
@mtopolnik I would rather avoid the issue of ownership in easy-to-use public API (provide a fresh instance every time) to make it less error-prone, but design some lower-lever APIs for advanced users that would allow them to write their own operator implementations that are optimized to the way they see fit.
Hi @mtopolnik for your use case you can consider:
suspend fun <E : Any> Channel<E>.consumeEachBlock(maxBlockSize: Int, consumer: (List<E>) -> Unit) {
consume {
val buffer = ArrayList<E>(maxBlockSize)
while (coroutineContext[Job]?.isActive != false) {
buffer += receiveOrNull() ?: return
while (buffer.size < maxBlockSize) {
val element = poll() ?: break
buffer += element
}
consumer(buffer)
buffer.clear()
}
}
}
@fvasco This code looks very similar to the code I wrote in the issue description. I can't make out what it improves.
isActive
flag, but isn't it checked on each poll
anyway?consume
which ensures that the receive channel is cancelled when exiting the function. I suppose this has an effect if we exit the loop prematurely due to the coroutine being cancelled.In order to make it efficient one indeed needs a built-in operation on channels that "pools" (drains?) multiple elements at once into a user-provided collection. That could help to reduce the amount of internal synchronization.
I went once again to inspect Agrona's queue, which I consider the canonical implementation of a zero-contention SPSC queue.
All operations are done with the release-acquire
consistency level (this avoids the costly volatile
writes) and none of the the metadata variables (head
, tail
, etc.) are updated from both sides, which would force the underlying cacheline to change ownership all the time. There's also padding between head
and tail
memory locations so they don't share the same cacheline. Only the backing array is being written from both sides, but at distant locations, this also reduces the cacheline churn.
The only saving that I see is that the loop in drain()
avoids reloading the head
variable in each iteration, but since only the consumer updates it, the penalty is quite small. It does update head
in each iteration, this has the benefit of allowing the producer to proceed as soon as possible with adding more items, especially if the consumer gets suspended.
@mtopolnik Unfortunately, channels are MPMC so adding "drain" to them would offer only so much benefit. For real scalability we'll need #254 that could use SPSP queues behind the scenes for buffers and switching between the threads.
As a cross-check, Agrona's MPMC
queue also doesn't improve anything in drain()
, which just calls poll()
in a loop. So it's there just as convenience.
@mtopolnik Our channel implementation could offer a bit of benefit above calling poll in the loop, but that benefit would be small compared to moving from MPMC to SPSC.
@mtopolnik
This code looks very similar to the code I wrote in the issue description.
Sorry, I forget it :-(
You have explicit checking of the isActive flag, but isn't it checked on each poll anyway?
See the note here: https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines.channels/-receive-channel/receive.html
@elizarov
In order to make it efficient one indeed needs a built-in operation on channels that "pools" (drains?) multiple elements at once into a user-provided collection
I propose to consider also sendAll
operator, in my use case I receive data from the network, parse it and then I send all messages to a channel.
Generally these primitives can be reconsidered implementing ByteChannel
s.
So basically we are postponing this feature until #254 and then we (probably) will be able to provide a consistent API for both cold and hot streams. We will keep this feature request in mind while designing it, thanks for pointing this out!
It is most probably will be an extension in order to help use-cases similar to the original one, where most of the performance win comes from the cutting down the cost of item pre-processing setup. While we are designing and implementing our core primitives as efficient as possible, it is not our goal to provide public access to primitives that are optimized for particular inter-thread communication patterns, such as specialized channels and/or specialized methods that amortize the cost of volatile memory operations.
I propose to consider also sendAll operator, in my use case I receive data from the network, parse it and then I send all messages to a channel.
What is the value of this operator? What problem does it solve to be introduced into the library?
What problem does it solve to be introduced into the library?
Reduce synchronization costs (adding a block in an ArrayChannel)
Do you have a real-world problem where a performance-critical code is dominated by the lock contention in ArrayChannel
?
None for sendMany
or receiveMany
.
This is what I have come up now using flows:
@ExperimentalCoroutinesApi
suspend fun <T> Flow<T>.bufferedChunks(maxChunkSize: Int): Flow<List<T>> = channelFlow<List<T>> {
require(maxChunkSize >= 1) {
"Max chunk size should be greater than 0 but was $maxChunkSize"
}
val buffer = ArrayList<T>(maxChunkSize)
collect {
buffer += it
if (buffer.size < maxChunkSize) {
val offered = offer(buffer.toList())
if (offered) {
buffer.clear()
}
} else {
send(buffer.toList())
buffer.clear()
}
}
if (buffer.size > 0) send(buffer.toList())
}.buffer(1)
So it is meant to be used as flowOf(...).bufferedChunks(10).map { ... }
or similar. I thought that keeps it more composable than taking a consumer function in directly. Not sure if it is the most efficient way to implement this but seems to work quite well in my experimentations.
So basically we are postponing this feature until #254 and then we (probably) will be able to provide a consistent API for both cold and hot streams. We will keep this feature request in mind while designing it, thanks for pointing this out!
Can we "unpostpone" this feature since #254 is closed?
Upd.:
Flow.conflate
where you need to perform an action only on most recent item, but all the previous stale items should be taken into account as part of some computation and hence can not be just dropped. Upd. 2: The only implementation of such batching I know from reactive world is @akarnokd's coalesce
I think we can manage to implement it as part of #1302. That is, a single chunked
operator that supports a duration and size limit parameters should work like a "natural buffering" when duration is set to zero and size limit is set to very big value.
@elizarov Does it mean, that #1558 is being rejected, because we want to have a single chunked
operator for both size and duration?
What about windowed
variance?
@circusmagnus We have not yet nailed down design of the API for flow windowed
/chunked
operations. Just gathering ideas and thoughts in all these issues.
Ok.
I think we can manage to implement it as part of #1302. That is, a single
chunked
operator that supports a duration and size limit parameters should work like a "natural buffering" when duration is set to zero and size limit is set to very big value.
@elizarov Here you mentioned chunked
operator and I quote:
Unlike
buffer
, this operator will be fully sequential.
Unless I misunderstand something, natural batching can not be sequential and implies concurrency just like buffer
. Moreover, your solution from here does create an additional coroutine with produceIn
.
For the sake of discussion in #1302 I would like to ask:
Do you still plan to make chunked
sequential or will it be more like buffer
and collect upstream concurrently?
Also, I would suggest you to consider natural batching use case for buffer
operator instead of (or in addition to?) chunked
.
Something like an optional parameter batch: Boolean = false
in buffer
. Buffer is already collecting upstream asynchronously and keeping the buffer. A simple switch to emit whatever is in the buffer right now as a List instead of one-by-one feels natural and in place from the API point of view.
A simple switch to emit whatever is in the buffer right now as a List instead of one-by-one feels natural and in place from the API point of view.
@pacher In principle, I agree. When I was looking for this functionality in the existing APIs, one of my first attempts was to see if I could get buffer
to behave the way I wanted it to behave.
However, I don’t think your proposal is doable: It’s not possible for buffer
to have a different return type based on whether batch
is set to true
or false
.
From my point of view, a non-sequential chunked
would make sense. Nevertheless, we might learn from @pacher’s comment and mine that adding a pointer to buffer
’s docs might help users finding their way. Something like:
If you want an operator that emits all currently buffered elements instead of each element one-by-one, use
chunked(parallel = true)
Good point about return type @jGleitz, don't know how I missed that. Chunked will have a List
However the question about chunked
being fully sequential and how it can fit with batching still remains.
@pacher
Unless I misunderstand something, natural batching can not be sequential and implies concurrency just like
buffer
. Moreover, your solution from here does create an additional coroutine withproduceIn
.
You are right. I have not really thought it through initially. Natural batching does require concurrency.
Also, I would suggest you to consider natural batching use case for
buffer
operator instead of (or in addition to?)chunked
. Something like an optional parameterbatch: Boolean = false
inbuffer
. Buffer is already collecting upstream asynchronously and keeping the buffer. A simple switch to emit whatever is in the buffer right now as a List instead of one-by-one feels natural and in place from the API point of view.
The name of buffer
would work, but it's signature (returning Flow<T>
) would not work. We'll need a separate operator that returns Flow<List<T>>
so it is still closer to the chunked
/windowed
operators. This trick is to actually design it in a way that avoids operator explosion.
I use this:
fun <T> Flow<T>.chunked(batchSize: Int): Flow<List<T>> = flow {
val accumulator = ArrayList<T>()
var counter = 0
this@chunked.collect {
accumulator.add(it)
if (++counter == batchSize) {
emit(accumulator)
accumulator.clear()
counter = 0
}
}
//emit the remainder if there's any
if (accumulator.size != 0) {
emit(accumulator)
}
}
Does the job for me
@schikin As far as I can tell your snippet does simple size-base chunking. The issue is about natural batching which is very different.
@pacher Got it, just I think people Googling about simple batching using Kotlin Flows (like this https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#buffer-int-) end up in this thread
Any progress on this? A simple recieveMany(maxCount) would help my case a lot.
I've used the following for "natural" buffering, works well for my needs:
fun <A> Flow<A>.chunked(): Flow<List<A>> = channelFlow {
var batch = mutableListOf<A>()
collectLatest {
batch.add(it)
send(batch)
batch = mutableListOf()
}
}.buffer(Channel.RENDEZVOUS)
^ Smart one. I do have a more low level solution. Might be a little more efficient, without corotuine cancellation and such:
Helper function to drain a channel of all elements, without suspending:
private tailrec fun <T> ReceiveChannel<T>.drain(acc: MutableList<T> = mutableListOf(), maxElements: Int): List<T> =
if (acc.size == maxElements) acc
else {
val nextValue = tryReceive().getOrElse { error: Throwable? -> error?.let { throw(it) } ?: return acc }
acc.add(nextValue)
drain(acc, maxElements)
}
^ It returns a chunk after all elements (which are currently present) are drained from a channel, or we reach max size of a chunk.
Another helper to wait for the first element in channel and then drain it:
private suspend fun <T> ReceiveChannel<T>.awaitFirstAndDrain(maxElements: Int): List<T> = try {
val first = receive()
drain(mutableListOf(first), maxElements)
} catch (e: ClosedReceiveChannelException) {
emptyList()
}
and finally a flow operator:
fun <T> Flow<T>.naturalBatching(maxSize: Int): Flow<List<T>> = flow {
coroutineScope {
val upstream = buffer(maxSize).produceIn(this) <- concurrently write to the chunking channel
while (!upstream.isClosedForReceive) {
val chunk = upstream.awaitFirstAndDrain(maxSize) <- concurrently drain the channel into chunks
if (chunk.isNotEmpty()) emit(chunk)
}
}
}
Natural aka. smart batching is a technique in stream processing that optimizes throughput without affecting latency. On the example of a concurrent queue, the consumer has the ability to atomically drain all the items observed at some instant and then process them as a batch. Ideally, the queue should be bounded, giving an upper limit to the batch size and providing backpressure to the sender at the same time.
It's called "natural" batching because there's no imposed batch size: when the traffic is low, it will process each item as soon as it arrives. In that case you don't need any throughput optimizations by batching items together. When the traffic gets higher, the consumer will automatically start processing larger batches, amortizing the fixed latency of a single operation like a database
INSERT
.I wrote this sample code that achieves the basic goal:
We can test it with this:
consumeBatched()
polls the queue one item at a time and therefore must additionally impose a batch limit. It would be more optimal if written against a concurrent queue like the Agrona project's OneToOneConcurrentArrayQueue, which supports thedrain
operation.Could support for natural batching be considered as a feature to add?
Comment by @qwwdfsad, taken from Stack Overflow:
It depends on the desired API surface.
drain
member is unlikely to be fit for channel semantics: it constraints implementation, it should somehow expose drain limit and it gives channel more "collection-like" API. E.g. how shoulddrain
behave with an unlimited channel? Is it possible to implementdrain
in an efficient manner (with pre-sized buffer, but avoiding OOMs and unlimited collections) once and use it with any channel implementation?What could be improved is additional hints from the channel such as expected capacity and count of enqueued elements. They can have a relaxed semantics with default implementation and act like hints to
drain
extension with some reasonable configurable upper bounds.