How to fetch N messages at a given (random) offset?

FrancoisBeaune commented 7 years ago

Hello,

Until now we were using kafka-rest to consume Kafka messages from our C# project. We're now investigating whether we would benefit by switching to kafka-sharp.

kafka-rest's API to consume messages is straightforward: you ask for (at most) N messages from a given offset via an HTTP GET request, and that request blocks until either N messages have been received or the end of the topic has been reached.

Naturally, it would make our life of switching to kafka-sharp easy if we could replicate that workflow, at least as a first step. Although we wouldn't immediately get all the benefits of a native Kafka driver (such as blocking when there is no more messages to consume in a topic), we would at least benefit on three fronts:

We could get rid of our kafka-rest cluster.
We should be able to save time by eliminating one HTTP roundtrip per batch of messages.
We could use features provided by Kafka but not exposed by kafka-rest, such as the offset API.

To test whether fetching N messages at random is possible and efficient with kafka-sharp, we wrote the following code:

var queue = new ConcurrentQueue<RawKafkaRecord>();
var completed = new AutoResetEvent(false);

cluster.MessageReceived += record =>
{
    queue.Enqueue(record);
    if (record.Offset == EndOffset)
    {
        completed.Set();
    }
};

var endOffset = beginOffset + messageCount - 1;

completed.Reset();
cluster.Consume(topic, partition, beginOffset);
cluster.StopConsume(topic, partition, endOffset);
completed.WaitOne(TimeSpan.FromMilliseconds(100));

RawKafkaRecord record;
while (queue.TryDequeue(out record))
{
    // Do something with the record.
}

Hopefully the idea is clear, but let's recap:

We instruct kafka-sharp to push received messages to a concurrent queue.
We then tell kafka-sharp to start consuming at some offset.
We also immediately tell kafka-sharp at which offset we want it to stop consuming.
We wait until we have received as many messages as requested, or until 100 ms have elapsed.
We then process the received messages.

I also found out that I had to adjust (somewhat empirically) the following settings to get maximum performance:

Configuration.FetchMessageMaxBytes set to 100 KB
Configuration.ConsumeBufferingTime set to the minimum value (0 millisecond)

Unfortunately the value of Configuration.FetchMessageMaxBytes depends on the number of messages I need to fetch. For 100-200 messages, 100 KB is nearly optimal.

My questions are:

Is this a proper way to use kafka-sharp's API?
Is this as efficient as it can be, given kafka-sharp's API and internals?

Thanks for the great library, and for your help.

sdanzan commented 7 years ago

Yes, it is currently the most efficient way to mimic the behaviour you're accustomed to.

FrancoisBeaune commented 7 years ago

Thanks for the confirmation.

criteo / kafka-sharp

How to fetch N messages at a given (random) offset? #7