Closed jaapterwoerds closed 8 years ago
@jaapterwoerds We're currently studying these edge cases. Nevertheless, after the configured number of retries, the system stops, right?. As you mention, a 404 Not Found error is pointing out that the Session
is no longer available so that we could stop re-trying in that case.
The system stops that's correct. I would suggest to stop the retry after getting the 404 and log a more specific message.
It makes sense. OTOH, we need to check what would be the behaviour of the underlying Akka cluster in that case, especially in the face of a quick restarting of the node.
Indeed retry'ing the renewal on a 404 does not make much sense.
Could we perhaps fail gracefully by registering a new session, though?
@jaapterwoerds are you sure the TTL really did expire? @juanjovazquez mentioned before that something like this can happen when the consul node associated with the session goes down, which sounds plausible (though I haven't been able to reproduce it in isolation, yet)
This seemed to recently successfully recover:
March 18th 2016, 22:15:52.782 DEBUG 31318 Successfully added self, going to RefreshScheduled
March 18th 2016, 22:15:52.782 DEBUG 31318 Transitioning to RefreshScheduled
March 18th 2016, 22:15:52.761 INFO 31318 Created session e8152e05-d0b5-b1bf-96d1-8ba463f41eaf with ttl 1 minute
March 18th 2016, 22:15:52.740 WARN 31318 unhandled event MemberUp(Member(address = akka.tcp://XXX@YYY:31826, status = Up)) in state AddingSelf
March 18th 2016, 22:15:52.739 DEBUG 31318 Transitioning out of Joining
March 18th 2016, 22:15:52.739 DEBUG 31318 Transitioning to AddingSelf
March 18th 2016, 22:15:52.738 DEBUG 31318 Received nodes Vector(akka.tcp://XXX@ZZZ:31796, akka.tcp://XXX@YYY:31826), going to Joining
March 18th 2016, 22:15:52.738 DEBUG 31318 Transitioning to Joining
March 18th 2016, 22:15:52.738 INFO 31318 Cluster Node [akka.tcp://XXX@WWW:31318] - Trying to join seed nodes [akka.tcp://XXX@ZZZ:31796, akka.tcp://XXX@YYY:31826] when already part of a cluster, ignoring
March 18th 2016, 22:15:52.731 DEBUG 31318 Waited for 3000 milliseconds, going to GettingNodes
March 18th 2016, 22:15:52.731 DEBUG 31318 Transitioning to GettingNodes
March 18th 2016, 22:15:49.718 WARN 31318 Failure in Refreshing, going to Retrying: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/5926655c-cd31-bf5b-7e21-d5d58b6e7940
March 18th 2016, 22:15:49.718 DEBUG 31318 Transitioning from Refreshing to Retrying
March 18th 2016, 22:15:49.717 WARN 31318 Failed to refresh session 5926655c-cd31-bf5b-7e21-d5d58b6e7940: status code 404 Not Found, entity Session id '5926655c-cd31-bf5b-7e21-d5d58b6e7940' not found
March 18th 2016, 22:15:49.710 DEBUG 31318 Waited for 30000 milliseconds, going to Refreshing
March 18th 2016, 22:15:49.710 DEBUG 31318 Transitioning to Refreshing
This was with constructr version 0.9.1-Wehkamp-0.6, which is 0.9.1 with some extra logging: https://github.com/hseeberger/constructr/compare/v0.9.1...wehkamp:v0.9.1-Wehkamp-0.6
@raboof I checked the logs for the service I am talking about. In two cases the refreshing failed with a 404 while the last succesfull refresh was 30 seconds earlier. The service was running with constructr version 0.11.1 and default settings(ttlFactor 2.0):
March 17th 2016, 16:40:49.164 Terminating the system, because constructr-machine has terminated! March 17th 2016, 16:40:49.164 Number of retries exhausted in Refreshing! March 17th 2016, 16:40:49.163 Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5 March 17th 2016, 16:40:49.156 Waited for 3000 milliseconds, going to Refreshing March 17th 2016, 16:40:49.156 Transitioning to Refreshing March 17th 2016, 16:40:46.144 Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5 March 17th 2016, 16:40:46.144 Transitioning from Refreshing to RetryScheduled March 17th 2016, 16:40:46.136 Waited for 3000 milliseconds, going to Refreshing March 17th 2016, 16:40:46.136 Transitioning to Refreshing March 17th 2016, 16:40:43.123 Transitioning from Refreshing to RetryScheduled March 17th 2016, 16:40:43.123 Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5 March 17th 2016, 16:40:43.115 Waited for 30000 milliseconds, going to Refreshing March 17th 2016, 16:40:43.115 Transitioning to Refreshing March 17th 2016, 16:40:13.104 Transitioning to RefreshScheduled March 17th 2016, 16:40:13.103 Successfully refreshed, going to RefreshScheduled
I also see some consul logging around that time, but I am not sure if I can relate that with this issue:
March 17th 2016, 16:40:53.000 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state
In any case, it seems we could improve the resiliency of constructr if we create a new session when the session refresh fails with a 404 and there are still retries left.
What you propose means virtually coming back to the AddingSelf
stage in terms of the cluster machine as the 404 response is pointing out that the key is no longer available in the KV registry. When the session expires, the key disappears. But, if the process comes back to AddingSelf
how are we sure that the node is still part of the Akka cluster?. This leads us to dangerous and problematic coordination issues. ;).
What I wonder now is why are you getting the 404 response in the first place?. The session ttl should be higher than the lapse between retries. In that cases, is the key leaving out the registry?.
We are still investigating why exactly we get the 404 response from consul. There is no definitive conclusion yet I am afraid. I suspect it has to do with the serfHealthcheck failures that are logged by consul.
To come back to the original issue, I would suggest to make a change that distinguishes between permanent failures(e.g. the 404 response) and transient failures(which can be retried). This distinction needs to be made in ConstructR itself(I will create an issue for this). The consul coordinator implementation then can signal a permanent failure when it encounters a 404 on session renewal. In this proposal both permanent failures and transient failures which have exhausted the number of retries will result in the termination of the actor system.
The debate on whether to change the behaviour on permanent failures, for instance to go back to state AddingSelf, is then a separate matter.
@jaapterwoerds @raboof After checking this in ConstructR project, may I close this issue?. Thanks.
We have also experienced the exact same issue. When I checked the list of active sessions in consul , the incriminated session was no longer there. The TTL is set to 60sec and the refresh interval left to the default value , 30s. We don't know yet why the session disappeared, but as consequence the node was shut down. Any further suggestion/idea on what's the root cause might be ? We are still investigating.
03:29:01.806 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.808 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.808 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.809 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
09:16:31.116 [system-akka.actor.default-dispatcher-2] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:34.168 [system-akka.actor.default-dispatcher-3] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:37.226 [system-akka.actor.default-dispatcher-4] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:37.231 [system-akka.actor.default-dispatcher-18] ERROR d.h.c.machine.ConstructrMachine - Number of retries exhausted in Refreshing!
09:16:37.234 [system-akka.actor.default-dispatcher-4] ERROR d.h.constructr.akka.Constructr - Terminating the system, because constructr-machine has terminated!
09:16:37.241 [system-akka.actor.default-dispatcher-2] INFO a.r.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
09:16:37.242 [system-akka.actor.default-dispatcher-2] INFO a.r.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote tr
ansports.
It might be related to https://github.com/hashicorp/consul/issues/1033 , our service is running in a docker container and by digging into the consul agent logs I have found that the agent at one point was not able to contact the service for the health check. That might have triggered consul to mark the node as down. According to consul documentation https://www.consul.io/docs/internals/sessions.html the session attached to the node is invalidated (Any of the health checks go to the critical state).
@gerson24 I have got this https://github.com/hseeberger/constructr/pull/120 merged in. At least now in the event of 404 the node will gracefully leave the cluster. Still does not fix this issue but I think it's a little better.
Any suggestion on how to handle the 404 issue ?
Since it appears that this issue continues, I'll reopen it to better think about how to handle it properly. @berardino thanks for the feedback.
@berardino We've discussed this internally and reached the conclusion that ConstructR should not be responsible for fixing a Consul related issue. IMHO ConstructR is already a resilient service as it enables you to set a certain number of retries. After that retries, and having left the Akka cluster (thanks to your PR on ConstructR), the better way to go would be let the node to stop working and let the cluster scheduler, or whatever machinery you have in place, to deal with the issue, for instance restarting the node or alerting about the problem. Does it make sense?.
yeah , I also think that this should be solved within the boundaries of Consul.
I did some changes to ConsulCoordination.refresh so that in the event of a NotFound response instead of keep trying to refresh the session, if the current node is still a member Up of the cluster, it will just invoke addSelf . This approach seems to be working a little better. But it still fails sometimes due to
Failure in AddingSelf, going to RetryScheduled/AddingSelf: java.util.NoSuchElementException: Future.filter predicate is not satisfied
So basically ConsulCoordination is not even able to create a new session. After adding more logging I have found that the createSession method is failing with :
500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,Check 'serfHealth' is in critical state)
Sometimes after a couples of retries this error goes away (the consul agent serf check reported an OK status) and ConsulCoordination is able to recover by creating a new session.
So what I think is going on here is that the consul agent randomly reports a false positive serf check. Consul server believes that the service is down therefore invalidates the sessions associated with the node.
So if you agree I can open a PR with the changes I have implemented that make the ConsulCoordination a little more resilient to faulty consul behaviors. As for the faulty serf check, I think that might be related to the service running in docker (at least in our case). I will keep investigating to see whether that can be solved too.
@berardino, thanks for your PR and comments. However, I don't agree with your proposal of coming back to the AddingSelf
stage. As I pointed out previously, if the machine comes back to AddingSelf
how can we be sure that the node is still a member of the Akka cluster?. Take into account that the joining phase is previous to the AddingSelf
phase. IMHO, the extra level of resiliency that you're looking for should be better reached by means of the cluster scheduler.
In any case, your PR changes the basic rules about the possible thansitions in ConstructR, so that I think you might discuss it with @hseeberger in ConstructR first. This project is about to give an alternative to etcd for coordination and it shouldn't change the basic rules set by ConstructR itself.
@juanjovazquez well to be precise ConstructR does not go back to the AddingSelf state. In fact is just the consul refresh method that uses the addSelf method to create a new session if unable to refresh the existing. If you want to decouple the two concept you might want to consider extracting the addSelf current implementation to a createIfNotExist method and having :
override def addSelf[A: NodeSerialization](self: A, ttl: FiniteDuration) = {
createIfNotExist(self, ttl)
}
def def createIfNotExist[A: NodeSerialization](self: A, ttl: FiniteDuration) = {
// actual addSelf implementation
}
then I can change my PR https://github.com/Tecsisa/constructr-consul/pull/27/files#diff-cfcdf993fe3f04a5f708a6d43c4ce068R160 to invoke createIfNotExist instead of addSelf. This makes clear that we are not adding self, but just trying to be resilient and create a session if unable to refresh the existing.
We have already deployed the changes and the system has been running just fine. As you can see from the attached logs, this issue can happen very often. And I'd rather try to recover this than letting the system going down and dealing with it at a cluster management level.
We are still investigating on why this keeps happening (it might be related to docker and the serfCheck reporting false critical state). As well as we are considering implementing an alternative solution that uses consul discovery service to bootstrap the cluster instead of having ConstructR that can potentially kill our services at any time.
But for the time being I think my PR is a good enough solution that mitigates the issue, until we find a better one.
00:08:48.229 [service-akka.actor.default-dispatcher-16] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/c27f8702-804b-e573-4881-fdc2790d2e90
02:09:03.526 [service-akka.actor.default-dispatcher-2] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/c27f8702-804b-e573-4881-fdc2790d2e90
10:07:34.102 [service-akka.actor.default-dispatcher-18] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/c27f8702-804b-e573-4881-fdc2790d2e90
10:08:44.215 [service-akka.actor.default-dispatcher-4] WARN akka.actor.ActorSystemImpl - Unable to refresh, session c27f8702-804b-e573-4881-fdc2790d2e90 not found. Creating a new one
12:08:49.470 [service-akka.actor.default-dispatcher-6] WARN akka.actor.ActorSystemImpl - Unable to refresh, session 3eee3ea2-56fb-98c0-4407-3ea8abe5df8c not found. Creating a new one
12:08:49.532 [service-akka.actor.default-dispatcher-4] ERROR akka.actor.ActorSystemImpl - Unable to create a new session 500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,Check 'serfHealth' is in critical state)
12:08:49.533 [service-akka.actor.default-dispatcher-4] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
12:08:59.569 [service-akka.actor.default-dispatcher-18] WARN akka.actor.ActorSystemImpl - Unable to refresh, session 3eee3ea2-56fb-98c0-4407-3ea8abe5df8c not found. Creating a new one
12:08:59.636 [service-akka.actor.default-dispatcher-2] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
12:09:09.679 [service-akka.actor.default-dispatcher-17] WARN akka.actor.ActorSystemImpl - Unable to refresh, session 3eee3ea2-56fb-98c0-4407-3ea8abe5df8c not found. Creating a new one
15:42:20.588 [service-akka.actor.default-dispatcher-4] WARN akka.actor.ActorSystemImpl - Unable to refresh, session 541cf141-a33a-a810-44e3-22d882566642 not found. Creating a new one
15:42:20.728 [service-akka.actor.default-dispatcher-6] ERROR akka.actor.ActorSystemImpl - Unable to create a new session 500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,rpc error: Check 'serfHealth' is in critical state)
15:42:20.728 [service-akka.actor.default-dispatcher-6] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
15:42:30.779 [service-akka.actor.default-dispatcher-5] WARN akka.actor.ActorSystemImpl - Unable to refresh, session 541cf141-a33a-a810-44e3-22d882566642 not found. Creating a new one
15:42:30.821 [service-akka.actor.default-dispatcher-5] ERROR akka.actor.ActorSystemImpl - Unable to create a new session 500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,rpc error: Check 'serfHealth' is in critical state)
15:42:30.822 [service-akka.actor.default-dispatcher-5] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
15:42:40.857 [service-akka.actor.default-dispatcher-4] WARN akka.actor.ActorSystemImpl - Unable to refresh, session 541cf141-a33a-a810-44e3-22d882566642 not found. Creating a new one
16:08:42.956 [service-akka.actor.default-dispatcher-2] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/745450d2-b270-44f6-366c-1e1cd66d2079
We have seen some cases in which refreshing fails because the session TTL has expired. The application was running with default settings but somehow the refresh was executed after the TTL on the session expired(Why this can happen is another story altogether). In that particular case you get a 404 when trying to renewing the session. The behaviour now is that it tries anyway to refresh the session while this will never happen because the session has expired.
Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5.
I would suggest to change the behaviour to stop retrying in this particular case.