confluentinc / librdkafka

The Apache Kafka C/C++ library
Other
276 stars 3.15k forks source link

segfault: C3IC: rdkafka: Failed to create thread: Resource temporarily unavailable (11) #2390

Closed agis closed 5 years ago

agis commented 5 years ago

Apologies for not attaching debug logs but this is the first time we've encountered this SEGFAULT, so I'm putting it here before it occurs again, in case it's a known issue.

Description

We're using confluent-kafka-go 0.11.6. The source code is available here.

We might have 20-30 consumers and 70-90 producers open concurrently. However, our /proc/sys/kernel/threads-max is 514353.

We observed the following segfault:

rafka[16863]: C3IC: rdkafka: Failed to create producer: Failed to create thread: Resource temporarily unavailable (11)
rafka[16863]: fatal error: unexpected signal during runtime execution
rafka[16863]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x6a0 pc=0x7f40ac78148e]
rafka[16863]: runtime stack:
rafka[16863]: runtime.throw(0x5ff2f4, 0x2a)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/panic.go:616 +0x81
rafka[16863]: runtime.sigpanic()
rafka[16863]:         /usr/lib/go-1.10/src/runtime/signal_unix.go:372 +0x28e
rafka[16863]: goroutine 48297 [syscall]:
rafka[16863]: runtime.cgocall(0x587270, 0xc420f1f5d8, 0x5feebc)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/cgocall.go:128 +0x64 fp=0xc420f1f590 sp=0xc420f1f558 pc=0x405574
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka._Cfunc_rd_kafka_new(0x0, 0x7f3c0401e3c0, 0x7f3c040116b0, 0x100, 0x0)
rafka[16863]:         _cgo_gotypes.go:835 +0x4e fp=0xc420f1f5d8 sp=0xc420f1f590 pc=0x52b6be
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka.NewProducer.func3(0x7f3c00000000, 0x7f3c0401e3c0, 0x7f3c040116b0, 0x100, 0x8ffda1)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/producer.go:418 +0x72 fp=0xc420f1f610 sp=0xc420f1f5d8 pc
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka.NewProducer(0xc420245760, 0x0, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/producer.go:418 +0x342 fp=0xc420f1f6b0 sp=0xc420f1f610 p
rafka[16863]: main.NewProducer(0xc4213af290, 0xc4200180e0, 0x14, 0x5a8720)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/producer.go:37 +0x40 fp=0xc420f1f760 sp=0xc420f1f6b0 pc=0x581100
rafka[16863]: main.(*Client).Producer(0xc421150a80, 0xc420010810, 0x12, 0xffe4, 0xc421074300)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/client.go:139 +0x174 fp=0xc420f1f828 sp=0xc420f1f760 pc=0x57ed84
rafka[16863]: main.(*Server).Handle(0xc4201281c0, 0x61afe0, 0xc4200b63c0, 0x61bbc0, 0xc4213f2060)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:257 +0x200c fp=0xc420f1ff78 sp=0xc420f1f828 pc=0x58397c
rafka[16863]: main.(*Server).ListenAndServe.func3(0xc420014450, 0xc4201281c0, 0x61afe0, 0xc4200b63c0, 0x61bbc0, 0xc4213f2060)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:390 +0x7d fp=0xc420f1ffb0 sp=0xc420f1ff78 pc=0x5864ad
rafka[16863]: runtime.goexit()
rafka[16863]:         /usr/lib/go-1.10/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420f1ffb8 sp=0xc420f1ffb0 pc=0x458e21
rafka[16863]: created by main.(*Server).ListenAndServe
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:388 +0x5df
rafka[16863]: goroutine 1 [chan receive, 22776 minutes]:
rafka[16863]: main.run(0xc42001c140)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/main.go:194 +0x1a5
rafka[16863]: main.main.func2(0xc42001c140, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/main.go:170 +0x2b
rafka[16863]: reflect.Value.call(0x5b29c0, 0x601c40, 0x13, 0x5f5e8d, 0x4, 0xc4200abc48, 0x1, 0x1, 0xc4200b6080, 0x5edfc0, ...)
rafka[16863]:         /usr/lib/go-1.10/src/reflect/value.go:447 +0x969
rafka[16863]: reflect.Value.Call(0x5b29c0, 0x601c40, 0x13, 0xc4200abc48, 0x1, 0x1, 0x0, 0x0, 0x5d8ac0)
rafka[16863]:         /usr/lib/go-1.10/src/reflect/value.go:308 +0xa4
rafka[16863]: github.com/urfave/cli.HandleAction(0x5b29c0, 0x601c40, 0xc42001c140, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/urfave/cli/app.go:482 +0x16d
rafka[16863]: github.com/urfave/cli.(*App).Run(0xc420001b00, 0xc420010090, 0x3, 0x3, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/urfave/cli/app.go:240 +0x4d0
rafka[16863]: main.main()
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/main.go:174 +0x457
rafka[16863]: goroutine 5 [syscall, 22776 minutes]:
rafka[16863]: os/signal.signal_recv(0x0)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/sigqueue.go:139 +0xa6
rafka[16863]: os/signal.loop()
rafka[16863]:         /usr/lib/go-1.10/src/os/signal/signal_unix.go:22 +0x22
rafka[16863]: created by os/signal.init.0
rafka[16863]:         /usr/lib/go-1.10/src/os/signal/signal_unix.go:28 +0x41
rafka[16863]: goroutine 6 [select, 22776 minutes, locked to thread]:
rafka[16863]: runtime.gopark(0x602138, 0x0, 0x5f6419, 0x6, 0x18, 0x1)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/proc.go:291 +0x11a
rafka[16863]: runtime.selectgo(0xc420088750, 0xc4200d2300)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/select.go:392 +0xe50
rafka[16863]: runtime.ensureSigM.func1()
rafka[16863]:         /usr/lib/go-1.10/src/runtime/signal_unix.go:549 +0x1f4
rafka[16863]: runtime.goexit()
rafka[16863]:         /usr/lib/go-1.10/src/runtime/asm_amd64.s:2361 +0x1
rafka[16863]: goroutine 7 [IO wait]:
rafka[16863]: internal/poll.runtime_pollWait(0x7f40acbcef00, 0x72, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/netpoll.go:173 +0x57
rafka[16863]: internal/poll.(*pollDesc).wait(0xc420012298, 0x72, 0xc4200b6e00, 0x0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:85 +0x9b
rafka[16863]: internal/poll.(*pollDesc).waitRead(0xc420012298, 0xffffffffffffff00, 0x0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:90 +0x3d
rafka[16863]: internal/poll.(*FD).Accept(0xc420012280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/internal/poll/fd_unix.go:372 +0x1a8
rafka[16863]: net.(*netFD).accept(0xc420012280, 0xc4213f2060, 0x0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/net/fd_unix.go:238 +0x42
rafka[16863]: net.(*TCPListener).accept(0xc42000e060, 0x408aaa, 0xc4200d23c0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/net/tcpsock_posix.go:136 +0x2e
rafka[16863]: net.(*TCPListener).Accept(0xc42000e060, 0xc4200d23c0, 0xc420014400, 0xc4201281c0, 0x61afe0)
rafka[16863]:         /usr/lib/go-1.10/src/net/tcpsock.go:259 +0x49
rafka[16863]: main.(*Server).ListenAndServe(0xc4201281c0, 0x61afe0, 0xc4200b63c0, 0xc420014430, 0xc, 0xc420014430, 0xc)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:379 +0x45e
rafka[16863]: main.run.func1(0xc4200143b0, 0x61afe0, 0xc4200b63c0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/main.go:187 +0x1fc
rafka[16863]: created by main.run
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/main.go:185 +0x18c
rafka[16863]: goroutine 8 [chan receive, 22776 minutes]:
rafka[16863]: main.(*ConsumerManager).Run(0xc420012180)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/consumer_manager.go:60 +0x4b
rafka[16863]: main.(*Server).ListenAndServe.func1(0xc4201281c0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:357 +0x5b
rafka[16863]: created by main.(*Server).ListenAndServe
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:355 +0x26d
rafka[16863]: goroutine 9 [chan receive, 22776 minutes]:
rafka[16863]: main.(*Server).ListenAndServe.func2(0xc420014440, 0x61afe0, 0xc4200b63c0, 0xc4201281c0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:369 +0x72
rafka[16863]: created by main.(*Server).ListenAndServe
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:367 +0x410
rafka[16863]: goroutine 47054 [chan receive, 27 minutes]:
rafka[16863]: main.(*Producer).consumeDeliveries(0xc42014a140)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/producer.go:93 +0x98
rafka[16863]: created by main.(*Producer).Produce
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/producer.go:59 +0xa4
rafka[16863]: goroutine 47355 [chan receive, 846 minutes]:
rafka[16863]: main.(*Consumer).Run(0xc420c04a00, 0x61afe0, 0xc420384740)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/consumer.go:72 +0xfc
rafka[16863]: main.(*ConsumerManager).GetOrCreate.func1(0xc420012180, 0xc420c04a00, 0x61afe0, 0xc420384740)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/consumer_manager.go:126 +0x6b
rafka[16863]: created by main.(*ConsumerManager).GetOrCreate
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/consumer_manager.go:124 +0x69a
rafka[16863]: goroutine 47539 [chan receive, 843 minutes]:
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka.channelProducer(0xc4211b6400)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/producer.go:445 +0x52
rafka[16863]: created by github.com/confluentinc/confluent-kafka-go/kafka.NewProducer
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/producer.go:436 +0x598
rafka[16863]: goroutine 47400 [syscall]:
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka._Cfunc__rk_queue_poll(0x7f3aec017960, 0x64, 0xc420a9f9fc, 0xc4214b1920, 0x0, 0x0)
rafka[16863]:         _cgo_gotypes.go:436 +0x4e
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka.(*handle).eventPoll.func1(0x7f3aec017960, 0xc400000064, 0xc420a9f9fc, 0xc4214b1920, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/event.go:156 +0x122
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka.(*handle).eventPoll(0xc4214b6248, 0x0, 0x64, 0x1, 0x0, 0x0, 0x0, 0xc4213ef700)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/event.go:156 +0x161
rafka[16863]: github.com/confluentinc/confluent-kafka-go/kafka.(*Consumer).Poll(0xc4214b6240, 0x64, 0xc420811370, 0xc4213eff28)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/confluentinc/confluent-kafka-go/kafka/consumer.go:257 +0x56
rafka[16863]: main.(*Consumer).Poll(0xc420531770, 0x64, 0x0, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/consumer.go:118 +0xbd
rafka[16863]: main.(*Server).Handle(0xc4201281c0, 0x61afe0, 0xc4200b63c0, 0x61bbc0, 0xc420aa0128)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:136 +0x10a9
rafka[16863]: main.(*Server).ListenAndServe.func3(0xc420014450, 0xc4201281c0, 0x61afe0, 0xc4200b63c0, 0x61bbc0, 0xc420aa0128)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:390 +0x7d
rafka[16863]: created by main.(*Server).ListenAndServe
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/skroutz/rafka/server.go:388 +0x5df
rafka[16863]: goroutine 47417 [IO wait]:
rafka[16863]: internal/poll.runtime_pollWait(0x7f40acbceaf0, 0x72, 0xc420eb7600)
rafka[16863]:         /usr/lib/go-1.10/src/runtime/netpoll.go:173 +0x57
rafka[16863]: internal/poll.(*pollDesc).wait(0xc420cbe618, 0x72, 0xffffffffffffff00, 0x61a060, 0x8fc4d8)
rafka[16863]:         /usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:85 +0x9b
rafka[16863]: internal/poll.(*pollDesc).waitRead(0xc420cbe618, 0xc42158c000, 0x10000, 0x10000)
rafka[16863]:         /usr/lib/go-1.10/src/internal/poll/fd_poll_runtime.go:90 +0x3d
rafka[16863]: internal/poll.(*FD).Read(0xc420cbe600, 0xc42158c000, 0x10000, 0x10000, 0x0, 0x0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/internal/poll/fd_unix.go:157 +0x17d
rafka[16863]: net.(*netFD).Read(0xc420cbe600, 0xc42158c000, 0x10000, 0x10000, 0xc42055a000, 0x4, 0x1000)
rafka[16863]:         /usr/lib/go-1.10/src/net/fd_unix.go:202 +0x4f
rafka[16863]: net.(*conn).Read(0xc42052c0b8, 0xc42158c000, 0x10000, 0x10000, 0x0, 0x0, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/net/net.go:176 +0x6a
rafka[16863]: io.ReadAtLeast(0x7f40acb8a000, 0xc42052c0b8, 0xc42158c000, 0x10000, 0x10000, 0x1, 0x1000, 0x4, 0x0)
rafka[16863]:         /usr/lib/go-1.10/src/io/io.go:309 +0x86
rafka[16863]: github.com/secmask/go-redisproto.(*Parser).readSome(0xc420eb7cf8, 0x1, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/secmask/go-redisproto/parser.go:83 +0x153
rafka[16863]: github.com/secmask/go-redisproto.(*Parser).ReadCommand(0xc420eb7cf8, 0x0, 0x0, 0x0)
rafka[16863]:         /build/rafka-0.4.0/obj-x86_64-linux-gnu/src/github.com/secmask/go-redisproto/parser.go:241 +0xdd
rafka[16863]: main.(*Server).Handle(0xc4201281c0, 0x61afe0, 0xc4200b63c0, 0x61bbc0, 0xc42052c0b8)

Checklist

edenhill commented 5 years ago

Is there anything interesting in dmesg? Are the consumer/producers in the same process, or multiple processes? Is there enough memory?

agis commented 5 years ago

Not anything interesting in dmesg at the time the the segfault was triggered. However, the OOM killer was triggered about 45minutes later, so this might be relevant.

The consumers/producers are in the same process, but separate goroutines.

edenhill commented 5 years ago

How many brokers do you have in the cluster?

agis commented 5 years ago

@edenhill We have 3 4 brokers.

After looking into /var/log/kern.log, turns out the OOM killer was invoked also 5 minutes earlier than the segfault, so that might be the culprit.

Maybe irrelevant but regarding the amounts of threads/processes in our service:

$ systemctl show -p TasksMax -p TasksCurrent rafka
TasksCurrent=936
TasksMax=6144
edenhill commented 5 years ago

librdkafka will create 1+N+B threads per instance, where N is the number of advertised brokers and B is the number of bootstrap servers not matching the advertised brokers. With the C3 interceptor you'll double the number of client instances and thus the number of threads.

While I don't think you're likely to hit the max thread count (maxprocs), your app might run out of memory due to each thread's stack.

agis commented 5 years ago

Thanks @edenhill. Closing this as a non-issue.