Reset the pods if jenkins fails to wake up within certain duration

sthaha commented 6 years ago

We have noticed that at times, jenkins pods fail to wake up and the only solution seems to kill the pod and start them again. Lets implement this as a part of jenkins-proxy's un-idle step.

piyush-garg commented 6 years ago

This will help https://github.com/openshiftio/openshift.io/issues/3517

hrishin commented 6 years ago

let's also reduce the polling frequency in init container? So it would be fail fast

concaf commented 6 years ago

@sthaha so, let's say, after the init container fails to connect to content-repository after trying for 15 times, it's going to report this to proxy and proxy's going to delete the pod and create a new one? Since this is a networking issue underneath, it might also be the case that the init container cannot talk to proxy either. This entire thing was done to achieve serialized execution, i.e. content-repository comes up first, and then jenkins comes up. Maybe this behavior can be achieved by proxy independently without using init containers. Ignore if I don't make sense :P

sthaha commented 6 years ago

@containscafeine proxy can keep track of the attempts to reach jenkins and decide to reset it. I think it may be better to do this on the idler side as it would need to react to build/deploy events and wake jenkins up.

jfchevrette commented 6 years ago

I've been trying to to reproduce the underlying network issue for the past couple of weeks without success, even by repeatedly idling/unidling my jenkins namespace through the jenkins-proxy API. The openshift networking team believe there is a problem in the way we idle/unidle in some specific situations. Some namespace are often getting stuck and I was unable to reproduce the issue on them using the idler API.

One theory was that jenkins-proxy may be asking openshift to scale up content-repository at the same time the jenkins init container is trying to wake it up by connecting to it's service which would be causing a weird situation at the openshift idler/SDN layer.

If we are confident that jenkins-idler is capable of handling the unidling of content-repository, we may want to try turning off the init container completely and see how it goes. If we do that, we would need jenkins-proxy to be aware of all jenkins DC changes and if jenkins DC is scaled up from outside of jenkins-idler/proxy (manually, or openshift itself unidles it), it would then have to react by also unidling/scaling up content-repository.

kishansagathiya commented 6 years ago

I think this will have to be done on the UI side as UI knows how many times we have tried to unidle Jenkins or how much time has it been. @sthaha WDYT?

sthaha commented 6 years ago

@kishansagathiya no, not in UI

The way I think of it, this must be done in idler which must keep a tab on the jenkins it tried to unidle and see if it got unidled properly. Like we already discussed, user-idler also unidles jenkins and the same problem can occur there can't it? So the caller asks the service to unidle and it shouldn't keep asking if the service if it really got unidled and then tell it to do its job properly.

cc @chmouel WDYT?

chmouel commented 6 years ago

Let me try to make sure I understand this discussion

when you say "jenkins pods fail to wake up" you are talking about that networking problem @jfchevrette has been trying to debug.

Basically @sthaha you are suggesting we find a workaround around this openshift-sdn bug in idler by keeping track how many time we have been trying to unidle, track if it was successfull and reset then after a certain number of tries?

What would be the definition of successful tho ? because we have seen cases where network connectivity would work at init time, but when spawning a slave the slave would fail to communicate (over jnlp) to the master.

If we are workaround the networking connectivity problem here, maybe that should be done inside jenkins instead, get the init-container wait-for conent-repository dependence gone and have jenkins when spinning a new job detect if it can communicate properly and fails otherwise.

What do you think ?

kishansagathiya commented 6 years ago

@sthaha Any thoughts on ^?

hrishin commented 6 years ago

Update:

We are going to removing content-repository from init container and let's see how things work. If the issue still persists will reset pod from idler. WIP for reset the pod https://github.com/fabric8-services/fabric8-jenkins-idler/pull/261

ppitonak commented 6 years ago

@hrishin where is the removing of content-repository tracked?

hrishin commented 6 years ago

@ppitonak ideally it could be seprate issue but its tracked https://github.com/openshiftio/openshift.io/issues/3895

kishansagathiya commented 6 years ago

@ppitonak Created an issue to track this https://github.com/openshiftio/openshift.io/issues/4083

sthaha commented 6 years ago

I am closing this as we seem to have solved the Jenkins not waking up issue by

removing the init-container hack
removing dependency on content-repository
tweaking the jvm params
allocating more resources to Jenkins

fabric8-services / fabric8-jenkins-proxy

Reset the pods if jenkins fails to wake up within certain duration #289

Update: