fabric8-services / fabric8-jenkins-idler

OpenShift.io service to idle resp.unidle Jenkins instances
Apache License 2.0
4 stars 15 forks source link

Jenkins idler is retaining stale data on active builds #143

Open ldimaggi opened 6 years ago

ldimaggi commented 6 years ago

Related to issue: https://github.com/openshiftio/openshift.io/issues/2418

It appears that the issue reported in #2418 is caused by stale data relating to a completed build that is marked as active. The idler sees this build as active:


    "Name": "ldimaggi-osiotest2",
    "ID": "61ddf7b3-d141-402e-ae6a-30ee65f1879b",
    "ActiveBuild": {
        "metadata": {
            "name": "march1test-2",
            "\nnamespace": "ldimaggi-osiotest2",
            "annotations": {
                "openshift.io/build.number": "2",
                "openshift.io/jenkins-namespace": "ldimaggi-o\nsiotest2-jenkins"
            },
            "Generation": 0
        },
        "status": {
            "phase": "Running",
            "startTimestamp": "2018-03-01T18:18:49Z",
            "completionTimestamp\n": "2018-03-01T18:18:51.416135334Z"
        },
        "spec": {
            "replicas": 0,
            "Strategy": {
                "Type": "JenkinsPipeline"
            }
        }
    },
    "DoneBuild": {
        "metadata": {
            "\nname": "march1test-1",
            "namespace": "ldimaggi-osiotest2",
            "annotations": {
                "openshift.io/build.number": "1",
                "openshift.io/jenkins-\nnamespace": "ldimaggi-osiotest2-jenkins"
            },
            "Generation": 0
        },
        "status": {
            "phase": "Complete",
            "startTimestamp": "2018-03-01T17:44:17\nZ",
            "completionTimestamp": "2018-03-01T18:09:19Z"
        },
        "spec": {
            "replicas": 0,
            "Strategy": {
                "Type": "JenkinsPipeline"
            }
        }
    },
    "JenkinsState\nList": null,
    "JenkinsLastUpdate": "2018-03-02T13:26:41Z"
}

And - an active build prevents the idler from running.

But - no builds are active:

oc get builds -n ldimaggi-osiotest2
No resources found.
hferentschik commented 6 years ago

@ldimaggi So here is the thing, where what happened with your builds in general? There is not even a completed build. Where are the march1test-1 and march1test-2 builds? They don't even seem to show up as completed builds.

I think you reset the environment, right? Does this also reset pipeline builds? The logic in the Idler is expecting a specific flow a Build goes through and transitions the internal state accordingly. I am wondering whether the resetting of the environment basically screws up this state transitions so that the state we keep in memory is getting out of sync.

@vpavlin what do you think?

@ldimaggi @aslakknutsen I am not familiar with how the resetting of the environment works. What does it in terms of OpenShift actions? What happens with existing builds? What type of events (if any) would be oberservable?

hferentschik commented 6 years ago

I think this relates to issue #120 and #141. We should consider timestamps and we should change how we model the data.

ldimaggi commented 6 years ago

It's my understanding that the env reset removes the build configs and deploy configs - not sure what in addition to that. Wouldn't deleting the bc's and dc's remove all running builds?

How can I clean up this situation today? I cannot see anything in OS O via oc.

hferentschik commented 6 years ago

It's my understanding that the env reset removes the build configs and deploy configs - not sure what in addition to that.

Sure

Wouldn't deleting the bc's and dc's remove all running builds?

I think the problem is really on the Idler side. Not sure whether there is much you can do from your end right now. Restarting the Idler might help. As part of this issue, I am planning to add some sort of reset call which would allow to reset the state for a single namespace. This way if a namespace gets into a inconsistent state (in terms of the model the Idler build of it), there is an easy way to reset just this namespace. But all this required changes on the Idler code first.

lordofthejars commented 6 years ago

@hferentschik I am starting looking at this issue. After talking with @chmouel it seems that the approach to fix this is to watch for delete build event, and when this happens then we remove the data from idler so there is no more stale data on it.

Do you think is the right approach to fix this, or you will prefer to have a /reset endpoint to be called externally to delete all data?

vpavlin commented 6 years ago

@hferentschik I am starting looking at this issue. After talking with @chmouel it seems that the approach to fix this is to watch for delete build event, and when this happens then we remove the data from idler so there is no more stale data on it.

Yeah, that makes sense, looking at the HandleBuild code, I think it would be good to look at what happens when the build is deleted and handle that case in this function.

That should prevent weird behaviour is you are able to recognise the delete event and react appropriately.

I'd still consider a /reset endpoint for the user/namespace to clean it up - might be even useful for testing. I'd also add the logic to call the endpoint for the "Reset environment" flow to make sure the user start with the clean slate.

Might be also useful to check what happens with Proxy - imagine there is a webhook buffered and you reset the environment - I am not sure if it will simply stay there and retry forever, or if it disappears (sorry, long time no see with the code:) ).

Hope this helps

chmouel commented 6 years ago

yeah +1 on having a /reset may be a good idea anyway for ops and "Reset Env"!

lordofthejars commented 6 years ago

Ok, I will start with /reset which seems easy to implement. But then rest removes all data right? I mean I do not filter anything.

lordofthejars commented 6 years ago

Now that I have implemented the /reset endpoint, I have started to look at reacting to a delete build event and remove these data do not become stale data.

The problem is that I am not really sure if this can be detected using the event. Let me explain why:

Build events are thrown for any change that occurs on that object, the build object is specified at https://docs.openshift.com/online/rest_api/apis-build.openshift.io/v1.Build.html#object-schema and there is one field called status which you expect that there is what you need to check to know if it has been deleted or not. Then there is one field that it is called phase which you might expect the phase of the build so if it is running, canceled or deleted. Sadly if you check the possible values of this field, you get next list (https://github.com/openshift/origin/blob/master/pkg/build/apis/build/types.go#L403) so there is no deleted phase.

So my naive question is: Is enough to react to the canceled event? Since if the build is running and someone deletes it, then the build is canceled. If it is deleted when done, then we have already received the complete event so we should not modify anything.

WDYT?

@chmouel @vpavlin

ldimaggi commented 6 years ago

Followup question - is the build automatically deleted when done?

kishansagathiya commented 6 years ago

Followup question - is the build automatically deleted when done?

Nope, which is why we are able to see all pipeline runs in OSIO and OpenShift. But that is something obvious, am I not understanding your question well?

kishansagathiya commented 6 years ago

so there is no deleted phase

If there is a build with phase deleted, it isn't really deleted, is it?

lordofthejars commented 6 years ago

Well why not, it is just the phase of deleting pipeline, but then the truth is that we cannot listen to delete event at all

El mar., 24 jul. 2018 17:39, Kishan Sagathiya notifications@github.com escribió:

so there is no deleted phase

If there is a build with phase deleted, it isn't really deleted, is it?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/fabric8-services/fabric8-jenkins-idler/issues/143#issuecomment-407452096, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcmYTBO36Pku44gT4XmhQyvl-S9fJlIks5uJz-WgaJpZM4SaBXg .

kishansagathiya commented 6 years ago

I was able to reproduce this.

kishansagathiya commented 6 years ago

So, all the build related things are stored in userIdler, which is stored in the memory. It takes some time for recent change to get reflected in userIdler. So, immediately after reset environments, if you call info api, it will get the old data that is stored in the memory. Given some time this should change to the current state.

This issue is consistent and easily reproducible. After resetting the environment I saw old build data. I ran a new build and this is what I saw after that.

[kishansagathiya@localhost fabric8-jenkins-idler]$ curl http://localhost:8080/api/idler/info/ksagathi-preview
{"error": "Could not find queried namespace"}[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ curl http://localhost:8080/api/idler/info/ksagathi-preview
{"error": "Could not find queried namespace"}[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
[kishansagathiya@localhost fabric8-jenkins-idler]$ curl http://localhost:8080/api/idler/info/ksagathi-preview
{"Name":"ksagathi-preview","ID":"7219a11c-f86a-4db1-ab3e-83216ff53009","ActiveBuild":{"metadata":{"annotations":{},"Generation":0},"status":{"phase":"New","startTimestamp":{"Time":"0001-01-01T00:00:00Z"},"completionTimestamp":{"Time":"0001-01-01T00:00:00Z"}},"spec":{"replicas":0,"Strategy":{"Type":""}}},"DoneBuild":{"metadata":{"name":"app-test-10-1","namespace":"ksagathi-preview","annotations":{"openshift.io/build.number":"1","openshift.io/jenkins-namespace":"ksagathi-preview-jenkins"},"Generation":0},"status":{"phase":"Complete","startTimestamp":{"Time":"2018-09-25T10:38:25Z"},"completionTimestamp":{"Time":"2018-09-25T13:35:27Z"}},"spec":{"replicas":0,"Strategy":{"Type":"JenkinsPipeline"}}},"JenkinsLastUpdate":"0001-01-01T00:00:00Z","IdleStatus":{"Timestamp":"0001-01-01T00:00:00Z","Success":false,"Reason":""}}
[kishansagathiya@localhost fabric8-jenkins-idler]$ 
kishansagathiya commented 6 years ago

blocked on https://github.com/openshiftio/openshift.io/issues/4356

kishansagathiya commented 6 years ago

Still blocked as prod-preview is down

kishansagathiya commented 6 years ago

Not blocked anymore

kishansagathiya commented 6 years ago

upstream issue filed for this https://github.com/openshift/origin/issues/21112