IBM / sarama

Sarama is a Go library for Apache Kafka.
MIT License
11.54k stars 1.76k forks source link

brokerProducer is nil, panic #2184

Closed simon28082 closed 1 year ago

simon28082 commented 2 years ago
Versions
Sarama Kafka Go
1.32.0 2.8.7 1.18
Problem Description

This is not an inevitable situation, but it happens many times This is a crash triggered by brokerProducer=nil Test scenario: A message is sent every second, and the network connection is interrupted during the sending process. During the process, the producer will asynchronously close the producer, and Panic will occur in some cases in about 3-5 minutes In github.com/Shopify/sarama/async_producer.go 560 lines

if msg.retries > pp.highWatermark {
    // a new, higher, retry level; handle it and then back off
    pp.newHighWatermark(msg.retries)
    pp.backoff(msg.retries)
}

func (pp *partitionProducer) newHighWatermark(hwm int) {
    Logger.Printf("producer/leader/%s/%d state change to [retrying-%d]\n", pp.topic, pp.partition, hwm)
    pp.highWatermark = hwm

    pp.retryState[pp.highWatermark].expectChaser = true
    pp.parent.inFlight.Add(1)
//  Note: This is going to cause panic, because brokerProducer=nil
    pp.brokerProducer.input <- &ProducerMessage{Topic: pp.topic, Partition: pp.partition, flags: fin, retries: pp.highWatermark - 1}

    Logger.Printf("producer/leader/%s/%d abandoning broker %d\n", pp.topic, pp.partition, pp.leader.ID())
    pp.parent.unrefBrokerProducer(pp.leader, pp.brokerProducer)
    pp.brokerProducer = nil
}

I also took a quick look at the Dispatch method, but couldn't figure out the specific reason This problem, only in 1.32.0, I have tested other versions without this problem

niamster commented 2 years ago

Hi @crcms Can you please test this with https://github.com/Shopify/sarama/pull/2182 ? Thank you!

slaunay commented 2 years ago

Line 560 for 1.32.0 does not seem to dereference a brokerProducer: https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L560

Would you be able to post the panic trace, confirm the version and ideally provide some Sarama logs?

Also what do you mean by During the process, the producer will asynchronously close the producer?

simon28082 commented 2 years ago

@slaunay I am just disconnecting the network during the running process.

slaunay commented 2 years ago

@crcms if you can share the panic trace that would go a long way.

simon28082 commented 2 years ago

@slaunay This is a stack of panic.

goroutine 58 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc000218600, 0x1)
        vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc000218600)
        vendor/github.com/Shopify/sarama/async_producer.go:562 +0x4f7
github.com/Shopify/sarama.withRecover(0x0)
        vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1fb

This is a must-have BUG, as long as you interrupt the network in the running process, this problem will occur about 3-5 minutes, and the number of retries set by Producer is the default.

simon28082 commented 2 years ago

Hi @crcms Can you please test this with #2182 ? Thank you!

Already testing, still has question

slaunay commented 2 years ago

Thanks @crcms, the stack points us to line 618 for the nil derefencing vs line 560: https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L615-L619

One scenario that would lead to the brokerProducer being nil prior to calling newHighWatermark would be if the following case <-pp.brokerProducer.abandoned matches: https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L547-L558

We need a unit test case to reproduce the issue in order to implement the proper fix. Such test case can be used as well to confirm whether it still happens with #2182.

simon28082 commented 2 years ago

@slaunay thank you very much, but the issue #2182 I used it, and it's still problematic

NewbieOrange commented 2 years ago

We are having the same issue, any updates on this?

edit: sarama v1.34.0 (go 1.19) with kafka v2.4.1

qitta commented 2 years ago

As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]

goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
    /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
    /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
    /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
    /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:515 +0x208

Are there any updates on this?

@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?

yusronhanan commented 2 years ago

As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]

goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
  /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
  /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
  /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
  /go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:515 +0x208

Are there any updates on this?

@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?

I face the same issue

thothothotho commented 1 year ago

got the same issue with sarama@1.37.2 on prod when we lost our kafka:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xfca3b0]
goroutine 469 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40006bc5a0, 0x1)
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/async_producer.go:708 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40006bc5a0)
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/async_producer.go:648 +0x4a4
github.com/Shopify/sarama.withRecover(0x0?)
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/async_producer.go:599 +0x208
Quuxplusone commented 1 year ago

We also see this segfault when the broker goes away. Sarama version 1.31.1 (2022-02-01). Issue #2322 is a duplicate of this issue #2184, right? Can we mark one or the other as a duplicate?

Dec 10 11:07:19 dut2 OUR: panic: runtime error: invalid memory address or nil pointer dereference
Dec 10 11:07:19 dut2 OUR: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1865bfd]
Dec 10 11:07:19 dut2 OUR: goroutine 131 [running]:
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc001361380, 0x1)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc001361380)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:562 +0x50a
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.withRecover(0xc0004a5400?)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
Dec 10 11:07:19 dut2 OUR: created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1f6
Dec 10 11:07:19 dut2 systemd: OUR.service: main process exited, code=exited, status=2/INVALIDARGUMENT
flgg2yybb commented 1 year ago

Just met the issue on sarama v1.29.1, any updates to solve it?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1ded78f]

goroutine 799 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0045287e0, 0x1)
    /root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:617 +0x1cf
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0045287e0)
    /root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:561 +0x589
github.com/Shopify/sarama.withRecover(0xc003fcde00)
    /root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/utils.go:43 +0x49
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
    /root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:512 +0x1aa
edoger commented 1 year ago

I have a currently feasible mitigation solution: first set the panic handler, rebuild all producers when any panic is caught, and because the panic handler is set, the producer will not automatically close when the producer internally crashes, and the input channel will be fast is filled and blocked, a producer rebuild is also required when this is detected.

asddongmen commented 1 year ago

got the same issue: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12b2609]

goroutine 74332 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0c18b9140, 0x3)
    github.com/Shopify/sarama@v1.36.0/async_producer.go:620 +0x1a9
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0c18b9140)
    github.com/Shopify/sarama@v1.36.0/async_producer.go:564 +0x537
github.com/Shopify/sarama.withRecover(0xc1ea43e580?)
    github.com/Shopify/sarama@v1.36.0/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
    github.com/Shopify/sarama@v1.36.0/async_producer.go:515 +0x1ea
asddongmen commented 1 year ago

seems this issue was fixed by:fix: add nil check while producer is retrying (

github-actions[bot] commented 1 year ago

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

dnwe commented 1 year ago

Closing as fixed by https://github.com/IBM/sarama/pull/2387