In async model , buffer overflow occurs frequently

doujiang24 / lua-resty-kafka

Lua kafka client driver for the Openresty based on the cosocket API

BSD 3-Clause "New" or "Revised" License

802 stars 275 forks source link

In async model , buffer overflow occurs frequently #47

Open IvyTang opened 7 years ago

IvyTang commented 7 years ago

We use the lua-resty-kafka to send our log to the kafka. The qps is 6K+ , size per request is 0.6K. However we see many buffer overflow errors in the errlog. Andr i found this error in ringbuffer.lua .

function _M.add(self, topic, key, message)
    local num = self.num
    local size = self.size

    if num >= size then
        return nil, "buffer overflow"
    end

    local index = (self.start + num) % size
    local queue = self.queue

    queue[index] = topic
    queue[index + 1] = key
    queue[index + 2] = message

    self.num = num + 3

    return true
end

What config should i set? And what does this error mean?

logbird commented 7 years ago

I have the same problem! The opt.max_buffering set to default value(50000), but the library print 'buffer overflow' when QPS is greater than 50. I debug the function _M.add , the self.size is 150000, Prove that the configuration is correct. @doujiang24

doujiang24 commented 7 years ago

@IvyTang @logbird This usually means the network between the producer and kafka server is not really fast. Does the producer and kafka server are in the same datacenter?

logbird commented 7 years ago

咱俩还是说中文吧。。。。我这边openresty 和 kafka服务部署在同一个机房，目前线上的量不应该出现这个问题的。我再开发环境复现这个问题的时候，发现有办法可以发现出这个情况。方法如下： ab 对 openresty 的 productor进行压测并发数50。关闭kafka服务，模拟kafka故障的情况。这时候kafka无法工作，但是openresty的buffer数量并不会增长，只是会在errlog里打印错误信息。然后消息丢失。但是。这个时候如果对nginx进行reload操作，reload后buffer数量会只增不减，直到触发overflow。这个问题产生方式可能与目前线上的触发方式不一样，因为线上的kafka服务一直是Ok的。。

所以还请帮忙查看一下，另外如果方便，希望可以QQ进行交流：1027672948 @doujiang24

Yuanxiangz commented 4 years ago

any solution on this issue?

TommyU commented 3 years ago

I've came into this problem, and solved it. share it for others:

the root cause of this:

there is ngx-worker(level) level lock to flush the message: get data from ringbuffer, fill into sending buffer, and request kafka, then relealse the lock
only one co-routine can get the lock
other co-routine can not get the lock if kafka request is not finished, will just quit
if the co-routine got the lock blocks(because of kafka), then the ringbuffer will get overflow

the solution to this problem:

let one and only one co-routine consume all the messages in ringbuffer
make sure that the co-routine has higher speed in consuming messages from ringbuffer than producing. which we need to adjust these params(in producer config):
- batch_num (eg. >= the max QPS of the message producing)
- flush_time (eg. 1000 ms)
- max_buffering (leave enough time for blocked kalfka request to finish or terminated, which means it must >= batch_num * (socket_timeout/1000 or 3) )

TommyU commented 3 years ago

[memo] The worker level lock can not be removed, will cause same message being read by multiple co-routine in timers in high QPS cases(which means message will be duplicated). This is wired as only one co-routine can get the cpu, and there should be no racing condition.