zeromq errors out upon logstash restart

gurvindersingh commented 10 years ago

When restarting logstash when receiving input from zeromq, logstash gives error as

{:timestamp=>"2014-05-01T00:38:36.503000+0200", :message=>"ZeroMQ error while in recv_string", :error_code=>-1, :level=>:error}

Error repeats for a while before logstash has started up completely, it functions fine afterwards though

gurvindersingh commented 10 years ago

On the sender side i am seeing this error while restarting the process

{:timestamp=>"2014-05-08T09:50:14.961000+0200", :message=>"ZeroMQ error while in send_string", :error_code=>-1, :level=>:error} {:timestamp=>"2014-05-08T09:50:14.961000+0200", :message=>"0mq output exception", :address=>["tcp://*:2100"], :exception=>#, :level=>:warn}

Also it does not restart properly process has to be killed, so it seems there is really an issue with zeromq to use it for input and output. I am using pushpull topolgy.

gurvindersingh commented 10 years ago

Here is debug information if this may help on the receiver side

ZMQ Error {:subscriber=>#<ZMQ::Socket:0x96e32ba @name="PULL", @option_lookup=[nil, 1, nil, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, nil, nil, nil, nil, nil, 0, 0], @int_cache=nil, @more_parts_array=[], @receiver_klass=ZMQ::Message, @socket=#, @longlong_cache=nil>, :exception=>#<RuntimeError: ZeroMQ Error while in recv_string>, :level=>:debug, :file=>"logstash/inputs/zeromq.rb", :line=>"155"} ZeroMQ error while in recv_string {:error_code=>-1, :level=>:error, :file=>"logstash/util/zeromq.rb", :line=>"26"}

gurvindersingh commented 10 years ago

The issue appears to be related to ZMQ_RCVTIMEO on receiving and ZMQ_SNDTIMEO for sending. It seems as the default value is -1 which means wait infinitely. So ZMQ just keep waiting for the message and discards the TERM signal, so you have to kill the process. If I set the ZMQ_RCVTIMEO value to say 5 sec, then if it does not get message in 5 sec it returns error as

ZeroMQ error while in recv_string {:error_code=>-1, :level=>:error}

As ZMQ sends EAGIAN on no message after timeout value. So I wonder what is the solution in this case ?

Any comments/suggestions are appreciated.

gurvindersingh commented 10 years ago

So I wonder what is the best solution in such a scenario. Don't use ZeroMQ at all, as there are not so many options left if you want elasticity without any single point of failure as ZeroMQ provides in pushpull topology.

Gurvinder On 05/09/2014 02:37 PM, Nuutti Kotivuori wrote:

Using ZMQ REQ/REP sockets is not going to be terribly easy here. Those sockets do not allow for retries or resends because of their state machine. The only way to recover from a dropped message is to close the socket and start fresh. If REP socket responses can sometimes be not sent, this requires handling also on the REP socket side. In addition to that, this requires that both server and client do these handlings in lock-step, with multiple parallel sockets used if parallelism is wanted.

I would say the ZeroMQ code needs to be re-engineered for reliability, if that is wished.

— Reply to this email directly or view it on GitHub https://github.com/elasticsearch/logstash/issues/1320#issuecomment-42661037.

colinsurprenant commented 10 years ago

Thanks for reporting & troubleshooting this. The zeromq filter has been moved to logstash-contrib so can you open the issue over there and reference this one?

gurvindersingh commented 10 years ago

I can do that, but I am using zeromq not as filter but input and output. Is input/output also now moved to logstsh-contrib ?

nakedible commented 10 years ago

Just a quick note: REQ/REP sockets in ZeroMQ are not really meant to work in scenarios where messages can be dropped. The only way to recover from any error is to close the socket entirely and open a new one. The plugin should be re-engineered for reliability.

gurvindersingh commented 10 years ago

It might be in REQ/REP sockets mode, but the issue which I mentioned here is for push pull topology and messages are not sent to the producer/servers back from clients only received from them and sent down the line to elasticsearch using elasticsearch output.

nakedible commented 10 years ago

Sorry, you are probably talking about the input/output zeromq plugins - the only one I took a peek at was the zeromq filter, which uses only a REQ socket. My mistake for thinking these two plugins were the same.

colinsurprenant commented 10 years ago

Oops, brain fart. You're talking about the zeromq input/output, ignore my last comment.

So, this is really hard to diagnose with just the logs you provided. Can you create a minimal config with steps to reproduce this problem?

gurvindersingh commented 10 years ago

For server side, which receives input from logstash-forwarder using lumberjack and output it using zeromq in server mode as

input { lumberjack { port => 5000 ssl_certificate => "" ssl_key => "" codec => plain { charset => "ISO-8859-1" } }

output { zeromq { address => ["tcp://*:2100"] topology => "pushpull" mode => "server" } }

Now on the client side to receive events using zeromq as

input { zeromq { address => topology => "pushpull" mode => "client" } }

output { stdout { codec => "json" } }

Just send one or few events from logstash-forwarder to server and once you see the event output to stdout on client side, try to stop the client logstash process. It will fail, you have to kill it. But if you enable elasticsearch output and run it in background and try to restart the process, you will the error I mentioned above in the report. But if you specify this config on client side to receive input as

input { zeromq { topology => "pushpull" address => ["tcp://server_adress:2100"] mode => "client" sockopt => ["ZMQ::RCVTIMEO", 5000] } }

you will see the same message being printed out every 5 secs, if logstash does not receive any message during 5 secs interval, but now you will be able to stop the process by SIGTERM signal rather than killing it with SIGKILL.

colinsurprenant commented 10 years ago

Ok, thanks, we'll have to run some tests with this & followup.

timbunce commented 10 years ago

I just hit a similar problem when I upgraded to logstash 1.4.1 from 1.3.2. I had a continuous stream of "ZeroMQ error while in recv_string" errors being spewed out. The problem turned out to be the fact the the logstash tar ball contains .gitignore files. See https://logstash.jira.com/browse/LOGSTASH-2252

jordansissel commented 9 years ago

@timbunce https://logstash.jira.com/browse/LOGSTASH-2252 is resolved, though I am not really sure how zeromq problems relate to .gitignore

jsvd commented 9 years ago

migrated to https://github.com/logstash-plugins/logstash-input-zeromq/issues/5

elastic / logstash

zeromq errors out upon logstash restart #1320