marathon-consul does not recover from connection refused in leader retrieval

hokiegeek2 commented 7 years ago

Connection error to co-located Marathon server instance results in marathon-consul restart after 5 connection errors with the following message:

 level=warning msg="Error on http request" Location="******:8080" Protocol="http" error="Get http://******:8080/v2/leader: dial tcp ******:8080: getsockopt: connection refused" statusCode"???"

This message is followed by:

level=fatal msg="Leader poll failed. Check marathon and previous errors. Exiting" error="Failed to get a leader after 5 retries"

This look happens three times and then, on the fourth iteration, marathon-consul gets into an endless loop of the following:

level=debug "Asking Marathon for leader" Location="******:8080"
level=debug "Sending GET request to marathon" Location="******:8080" Protocol=http Uri="/v2/leader"
level=debug "I am not leader"

When I discovered this last round of error messages was ocurring, I was able to curl this URL from the host box, so it seems marathon-consul got in a state that the leader check failed irrespective of return of connectivity to the co-located marathon instance.

I recommend making a plugin framework that enables different options for handling failure of the co-located marathon instance. One possibility would be to enable passing a comma-delimited marathon urls analogous to a list of zookeepers.

janisz commented 7 years ago

@hokiegeek2 thanks for reporting. Endless loop of I'm not a leader is caused becouse caused colocated instance is not a leader. You can disable this leadership check with

--marathon-leader Marathon cluster-wide node name (defaults to <hostname>:8080), the some leader specific calls will be made only if the specified node is the current Marathon-leader. Set to * to always act like a Leader.

Is this working for you?

hokiegeek2 commented 7 years ago

@janisz Cool, I will try that.

hokiegeek2 commented 7 years ago

@janisz I tried setting marathon-leader=* and that immediately caused the I'm not a leader" loop.

Here's my configuration:

--sse-enabled=true 
--web-enabled=false 
--marathon-leader=<auto lookup vi localhost:8080/v2/leader upon service startup> 
--marathon-location=<auto lookup of node upon service startup>

with default setting of sync-enabled=true

janisz commented 7 years ago

Are you using the latest version?

hokiegeek2 commented 7 years ago

Yes, I grabbed the binary from github

janisz commented 7 years ago

The configuration you posted does not use * is it correct? Are you sure you used this setting properly. You should see log debug message informing leader detection is disabled.

janisz commented 7 years ago

@hokiegeek2 Is the issue solved?

hokiegeek2 commented 7 years ago

@janisz I did use one configuration where marathon-leader = *, but that's the one that immediately sent into the "I am not a leader". The configuration I posted above is the one I am using where marathon-consul goes into the "I am not a leader" loop eventually.

One thing I tried over the last day is to have just one marathon-consul running (had a cluster of four previously). With just one I am not seeing any "I am not a leader" looping issues. I've previously seen the recommendation for running just one marathon-consul instance on the marathon-consul README. Is it a problem to run 2..n marathon-consul instances? I am doing that for HA.

--John

hokiegeek2 commented 7 years ago

@janisz My apologies, I actually had 1.3.1 I just grabbed 1.3.3 and will test today

janisz commented 7 years ago

No problem. Please confirm if it's working for you. Any ideas how to improve README will be welcome.

hokiegeek2 commented 7 years ago

This version fixes the problem of failing to find the leader. However, in the process of troubleshooting this, I still see a series of marathon-consul restarts due to failure to connect to the co-located marathon. Strangely, at the same time marathon-consul logs that it cannot connect to the co-located marathon I can curl the marathon REST API just fine. I am not saying this is a problem with marathon-consul, but this is an issue in my deployment environment.

allegro / marathon-consul

marathon-consul does not recover from connection refused in leader retrieval #242