Easily reproducible message timeouts

winbatch commented 10 years ago

Hi - not sure if this is an rdkafka issue or an actual kafka issue. (Not sure which one I want..)

I'm finding that when I send a lot of small messages, I frequently get message timeouts. I see this in both using the api as well as using the sample programs.

Example:

./rdkafka_performance -P -t TENSECONDLOGS -b masked.com:5757 -s 2 -a 2 -z gzip -p 1 -c 1000000 -X queue.buffering.max.messages=1000000

Message delivered: 471000 remain^M Message delivered: 470000 remain^M Message delivered: 469000 remain^M Message delivered: 468000 remain^M Message delivered: 467000 remain^M Message delivered: 466000 remain^M Message delivered: 465000 remain^M Message delivered: 464000 remain^M Message delivered: 463000 remain^M Message delivered: 462000 remain^M Message delivered: 461000 remain^M Message delivered: 460000 remain^M

if I strace while its hanging, I see:

futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x137418c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7175, {1391707106, 253292000}, ffffffff) = -1 ETIMEDOUT (Connection timed out) futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0

I sometimes see this on the cluster (but definitely not always):

[2014-02-06 22:02:54,585] INFO Client session timed out, have not heard from server in 4815ms for sessionid 0x343c4f6383400a7, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2014-02-06 22:02:54,689] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2014-02-06 22:02:55,201] INFO Opening socket connection to server zookeeperlog1.masked.com/10.52.189.106:5700 (org.apache.zookeeper.ClientCnxn) [2014-02-06 22:02:55,202] INFO Socket connection established to zookeeperlog1.masked.com/10.52.189.106:5700, initiating session (org.apache.zookeeper.ClientCnxn) [2014-02-06 22:02:55,207] INFO Session establishment complete on server zookeeperlog1.masked.com/10.52.189.106:5700, sessionid = 0x343c4f6383400a7, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2014-02-06 22:02:55,208] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)

do you think I should simply increase the zookeeper timeout? (note it's running on the same host as the cluster node). Have you seen this before?

edenhill commented 10 years ago

I think this might be caused by internal message framing overhead and I have an idea how to fix it.

But I have to ask; is this a real use case, sending lots of very small messages?

winbatch commented 10 years ago

I was doing it as part of testing, but it could theoretically happen. Remember that I am unable to batch, so if I have small content available, I'll send it immediately

On Friday, February 7, 2014, Magnus Edenhill notifications@github.com wrote:

I think this might be caused by internal message framing overhead and I have an idea how to fix it.

But I have to ask; is this a real use case, sending lots of very small messages?

Reply to this email directly or view it on GitHubhttps://github.com/edenhill/librdkafka/issues/74#issuecomment-34531309 .

edenhill commented 10 years ago

Cant reproduce this:

$ ./rdkafka_performance -P -t multi -p 0 -b localhost -s 2 -a 2 -z gzip -c 1000000 -X queue.buffering.max.messages=1000000

...

% 1000000 messages and 2000000 bytes produced in 4809ms: 207911 msgs/s and 0.42 Mb/s, 0 messages failed, gzip compression

Some questions:

how many brokers are in the replication set for this topic?
what happens if you try -a -1, 0 and 1? (request.required.acks) NOTE: For 0 you will need to update to latest master due to issue #75
is it possible that it's the replication between brokers thats the problem? I think you might have insinuated this in another issue at some point.
can you you reproduce this with -d all ? There's going to be alot of output so you can mail it to me directly.

Thanks

winbatch commented 10 years ago

You may have to run it multiple times. But also, I am using brokers that are not localhost. 3 brokers 3 replicas. unable to reproduce with all the debugging presumably because it slows down the app enough. Will try with the other ack values

On Saturday, February 8, 2014, Magnus Edenhill notifications@github.com wrote:

Cant reproduce this:

$ ./rdkafka_performance -P -t multi -p 0 -b localhost -s 2 -a 2 -z gzip -c 1000000 -X queue.buffering.max.messages=1000000

...

% 1000000 messages and 2000000 bytes produced in 4809ms: 207911 msgs/s and 0.42 Mb/s, 0 messages failed, gzip compression

Some questions:

how many brokers are in the replication set for this topic?

what happens if you try -a -1, 0 and 1? (request.required.acks) NOTE: For 0 you will need to update to latest master due to issue #75https://github.com/edenhill/librdkafka/issues/75

is it possible that it's the replication between brokers thats the problem? I think you might have insinuated this in another issue at some point.

can you you reproduce this with -d all ? There's going to be alot of output so you can mail it to me directly.

Thanks

Reply to this email directly or view it on GitHubhttps://github.com/edenhill/librdkafka/issues/74#issuecomment-34537376 .

edenhill commented 10 years ago

I managed to reproduce this by adding latency to the loopback like this: sudo tc qdisc add dev lo root netem delay 350ms

The above commit fixes the problem, can you verify on your end?

winbatch commented 10 years ago

Didnt know you could do such a thing. Will try over the weekend. Thanks very much.

On Saturday, February 8, 2014, Magnus Edenhill notifications@github.com wrote:

I managed to reproduce this by adding latency to the loopback like this: sudo tc qdisc add dev lo root netem delay 350ms

The above commit fixes the problem, can you verify on your end?

Reply to this email directly or view it on GitHubhttps://github.com/edenhill/librdkafka/issues/74#issuecomment-34545163 .

winbatch commented 10 years ago

Magnus - the fix didn't seem to work right. I simply got the timeouts much more quickly (like a second) than I had in the past. And the number remaining seemed to follow some weird pattern... Message delivered: 556000 remain Message delivered: 555000 remain Message delivered: 554000 remain Message delivered: 553000 remain Message delivered: 552000 remain Message delivered: 551000 remain Message delivered: 550000 remain Message delivered: 549000 remain Message delivery failed: Local: Message timed out (548827 remain) Message delivery failed: Local: Message timed out (548826 remain) Message delivery failed: Local: Message timed out (548825 remain) Message delivery failed: Local: Message timed out (548824 remain) Message delivery failed: Local: Message timed out (548823 remain) Message delivery failed: Local: Message timed out (548822 remain) Message delivery failed: Local: Message timed out (548821 remain) Message delivery failed: Local: Message timed out (548820 remain) Message delivery failed: Local: Message timed out (548819 remain) Message delivery failed: Local: Message timed out (548818 remain) Message delivery failed: Local: Message timed out (548817 remain) Message delivery failed: Local: Message timed out (548816 remain) Message delivery failed: Local: Message timed out (548815 remain) Message delivery failed: Local: Message timed out (548814 remain) Message delivery failed: Local: Message timed out (548813 remain) Message delivery failed: Local: Message timed out (548812 remain) Message delivery failed: Local: Message timed out (548811 remain) Message delivery failed: Local: Message timed out (548810 remain) Message delivery failed: Local: Message timed out (548809 remain) Message delivery failed: Local: Message timed out (548808 remain) Message delivery failed: Local: Message timed out (548807 remain) Message delivery failed: Local: Message timed out (548806 remain) Message delivery failed: Local: Message timed out (548805 remain) Message delivery failed: Local: Message timed out (548804 remain) Message delivery failed: Local: Message timed out (548803 remain) Message delivery failed: Local: Message timed out (548802 remain) Message delivery failed: Local: Message timed out (548801 remain) Message delivery failed: Local: Message timed out (548800 remain) Message delivery failed: Local: Message timed out (548799 remain) Message delivery failed: Local: Message timed out (548798 remain) Message delivery failed: Local: Message timed out (548797 remain) Message delivery failed: Local: Message timed out (548796 remain) Message delivery failed: Local: Message timed out (548795 remain) Message delivery failed: Local: Message timed out (548794 remain) Message delivery failed: Local: Message timed out (548793 remain) Message delivery failed: Local: Message timed out (548792 remain) Message delivery failed: Local: Message timed out (548791 remain) Message delivery failed: Local: Message timed out (548790 remain) Message delivery failed: Local: Message timed out (548789 remain) Message delivery failed: Local: Message timed out (548788 remain) Message delivery failed: Local: Message timed out (548787 remain) Message delivery failed: Local: Message timed out (548786 remain) Message delivery failed: Local: Message timed out (548785 remain) Message delivery failed: Local: Message timed out (548784 remain) Message delivery failed: Local: Message timed out (548783 remain) Message delivery failed: Local: Message timed out (548782 remain) Message delivery failed: Local: Message timed out (548781 remain) Message delivery failed: Local: Message timed out (548780 remain) Message delivery failed: Local: Message timed out (548779 remain) Message delivery failed: Local: Message timed out (548000 remain) Message delivery failed: Local: Message timed out (547828 remain) Message delivery failed: Local: Message timed out (547000 remain) Message delivery failed: Local: Message timed out (546828 remain) Message delivery failed: Local: Message timed out (546000 remain) Message delivery failed: Local: Message timed out (545828 remain) Message delivery failed: Local: Message timed out (545000 remain) Message delivery failed: Local: Message timed out (544828 remain)

winbatch commented 10 years ago

Hi Magnus - I believe you mentioned offline that this could be caused by librdkafka using the wrong parameter for the message timeout? Just wondering if you figured it out?

edenhill commented 10 years ago

Those timeouts were fixed prior to what you tested last in this issue report (940dfdc), so I dont think you befited from those fixes in this scenario.

winbatch commented 10 years ago

Not sure I follow - you're saying its still broken? (Because that's what I'm seeing)

edenhill commented 10 years ago

Still broken, yeah, I'll try to repro

edenhill commented 10 years ago

I think this might be caused by rdkafka_performance changing the default value of message.timeout.ms (300s) to 5s. This is a legacy and I dont know why this value was chosen. Try setting it to its default value, .. -X topic.message.timeout.ms=300000, and the problem should go away.

edenhill commented 10 years ago

Or update to latest master and the default override will be gone.

mikesparr commented 8 years ago

@edenhill could this be cause of what I'm seeing in logs? should I explicitly add topic.message.timeout.ms to my confluent-kakfa-python producer class?

2016-09-16 08:56:01,425 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
2016-09-16 09:01:08,140 - INFO - extractor_rets.py - Processed 0 rows
2016-09-16 09:01:08,141 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
%4|1474038111.837|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038111.837|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038111.837|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038174.946|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038174.946|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038174.946|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038238.055|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038238.055|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038238.055|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038252.485|FAIL|app2#producer-1| queue4:9092/4: Receive failed: Connection reset by peer
%3|1474038252.485|ERROR|app2#producer-1| queue4:9092/4: Receive failed: Connection reset by peer
%4|1474038300.358|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038314.380|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038314.380|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038316.394|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
2016-09-16 09:06:13,052 - INFO - extractor_rets.py - Processed 0 rows
2016-09-16 09:06:13,053 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
%3|1474038375.488|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038375.488|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038377.501|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038436.593|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038436.593|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038438.607|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038497.699|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038497.699|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038499.713|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out

edenhill commented 8 years ago

The Metadata timeouts are an indication of some problem with the broker(s) or network, they should typically not time out. Check your broker logs.

mikesparr commented 8 years ago

I tried setting that in Producer client configuration and see this error:

2016-09-16 10:14:00,476 - ERROR - stream_producer.pyc - Error [KafkaError{code=_INVALID_ARG,val=-186,str="No such configuration property: "message.timeout.ms""}] publishing data at topic [rets.rayac.Office]:[None]-[rayac:1347]

But in your docs I see it, and configured same as other settings?

Configured...

screen shot 2016-09-16 at 10 17 38 am

mikesparr commented 8 years ago

I'll try request.timeout.ms instead since not "local" setting from python client. Still trying to figure out broker issue. I'm curious if perhaps the buffer size being 1 million could be impacting the Open file limits on server? Is a file created for each buffered message and perhaps that explains why after they all get running we hit that ceiling and crash brokers? Sorry to hijack thread but his errors seem all too similar to what I'm experiencing.

edenhill commented 8 years ago

message.timeout.ms is a topic-level config property so you will need to set it in the "default.topic.config" sub-dict.

Messages are written to log segments, when a log segment reaches a certain configured sized it is rolled over and a new log segment is created. So yes, the message size * message count does affect the number of open log segments. Please see broker properties log.segment.bytes and log.roll.* here: http://kafka.apache.org/documentation.html#brokerconfigs

mikesparr commented 8 years ago

Thanks for clarification. Crazy thing happened. On a whim I updated my Ansible playbooks and reduced 4-node cluster to 3-node for Zookeeper/Kafka and Cassandra I removed 4th node too. Restarted everything and ETL running fine; no issues whatsoever.

Is it possible issues caused by having an even number of servers vs. and odd number? I did read that Zookeeper should be Odd number but very surprised if that would actually cause all these issues. Now everything is working 100% moving from 4 node to 3 node.

confluentinc / librdkafka

Easily reproducible message timeouts #74