doujiang24 / lua-resty-kafka

Lua kafka client driver for the Openresty based on the cosocket API
BSD 3-Clause "New" or "Revised" License
802 stars 275 forks source link

In async model , buffer overflow occurs frequently #47

Open IvyTang opened 7 years ago

IvyTang commented 7 years ago

We use the lua-resty-kafka to send our log to the kafka. The qps is 6K+ , size per request is 0.6K. However we see many buffer overflow errors in the errlog. Andr i found this error in ringbuffer.lua .

function _M.add(self, topic, key, message)
    local num = self.num
    local size = self.size

    if num >= size then
        return nil, "buffer overflow"
    end

    local index = (self.start + num) % size
    local queue = self.queue

    queue[index] = topic
    queue[index + 1] = key
    queue[index + 2] = message

    self.num = num + 3

    return true
end

What config should i set? And what does this error mean?

logbird commented 7 years ago

I have the same problem! The opt.max_buffering set to default value(50000), but the library print 'buffer overflow' when QPS is greater than 50. I debug the function _M.add , the self.size is 150000, Prove that the configuration is correct. @doujiang24

doujiang24 commented 7 years ago

@IvyTang @logbird This usually means the network between the producer and kafka server is not really fast. Does the producer and kafka server are in the same datacenter?

logbird commented 7 years ago

咱俩还是说中文吧。。。。 我这边openresty 和 kafka服务部署在同一个机房,目前线上的量不应该出现这个问题的。 我再开发环境复现这个问题的时候,发现有办法可以发现出这个情况。 方法如下: ab 对 openresty 的 productor进行压测并发数50。 关闭kafka服务,模拟kafka故障的情况。 这时候kafka无法工作,但是openresty的buffer数量并不会增长,只是会在errlog里打印错误信息。然后消息丢失。 但是。这个时候如果对nginx进行reload操作,reload后buffer数量会只增不减,直到触发overflow。 这个问题产生方式可能与目前线上的触发方式不一样,因为线上的kafka服务一直是Ok的。。

所以还请帮忙查看一下,另外如果方便,希望可以QQ进行交流:1027672948 @doujiang24

Yuanxiangz commented 4 years ago

any solution on this issue?

TommyU commented 3 years ago

I've came into this problem, and solved it. share it for others:

the root cause of this:

the solution to this problem:

TommyU commented 3 years ago

[memo] The worker level lock can not be removed, will cause same message being read by multiple co-routine in timers in high QPS cases(which means message will be duplicated). This is wired as only one co-routine can get the cpu, and there should be no racing condition.