Get insights about builds failures

yzainee-zz commented 6 years ago

Currently we don't have an easy way to figure out what sub-component of the build system caused the failure. Lets fix this by instrumenting this functionality into the sub-components so that we can trace why/how a build failed.

This issue to tracks the following points:

[x] Get the proxy setup and make sure the build works
[x] Make some minor changes and run through the steps to see the change in production ( can be a minor log addition )
[x] Identify the entry points ( APIs ) that get hit when a build is triggered via the UI/webhook in prod preview and local
[ ] Store these informations at a central location so that we can get insights regarding which parts are creating problems, 1) Where is the build failing 2) With what error is it failing. As discussed during F2F, these errors are not very informative. Explore what errors are there and explore possibilities of throwing more meaningful errors. This would require investigation on openshift sync plugin.

jaseemabid commented 6 years ago

@yzainee I know I shouldn't talk about performance without numbers, but I'm still trying this out.

Proxy: Isn't the primary workload connecting 2 network requests and waiting for them in the middle? Shouldn't go be able to scale really well for this workload because its easily network bound? I'd expect the number of requests it can handle to be close enough to (network bandwidth / size of a request) with well written concurrent code. If we come up with a number N for the number of requests we can handle, I'd like to be reported as a fraction of this expected upper bound.

Idler: Same case as above, but I think primary bottleneck will be the time taken for the openshift cluster to idle/unidle. We probably might have to do nothing at all here to reach perf upper bound.

yzainee-zz commented 6 years ago

@jaseemabid We will get to know about that once we perform this exercise. Having said that, what we are trying to infer is, if instead of running 1 proxy, if we can run multiple proxy using queues, does it make a difference in the performance.

chmouel commented 6 years ago

@yzainee FYI Formatted the github issue for you,

Let's make sure we chat with SD team first if we have doing any sort of load testing on any platforms (prodpreview, prod), /cc @kbsingh

jaseemabid commented 6 years ago

From a service reliability standpoint we should run a separate idler per cluster, but I doubt if there will be any major performance changes. There are several reports of go servers handling millions of requests per minute for IO bound workloads online.

chmouel commented 6 years ago

let's gather numbers first before we assume what needs to be done and focus on that,

we need to think about deployment complexity too, if there is not much gains and it become hard to deploy then we should drop it.

kishansagathiya commented 6 years ago

This is what I see with the pipeline

On starting the Pipeline:-

Request URL:https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/buildconfigs/mydemo2%2Finstantiate
Request Method:OPTIONS
Status Code:204 No Content
Request URL:https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/buildconfigs/mydemo2%2Finstantiate
Request Method:POST
Status Code:201 Created

It requires github Token for build Release:

Request URL:https://auth.openshift.io/api/token?for=https://github.com
Request Method:GET
Status Code:200 OK

On Approving to roll out:-

Request URL:https://forge.api.openshift.io/api/openshift/services/jenkins/ksagathi-jenkins/job/kishansagathiya/job/mydemo2/job/master/1/input/Proceed/proceedEmpty
Request Method:OPTIONS
Status Code:200 OK
Request URL:https://forge.api.openshift.io/api/openshift/services/jenkins/ksagathi-jenkins/job/kishansagathiya/job/mydemo2/job/master/1/input/Proceed/proceedEmpty
Request Method:POST
Status Code:200 OK

We know about pipeline stage through Openshift APIs (by watching Build and BuildConfig struct)

https://docs.openshift.org/latest/rest_api/apis-build.openshift.io/v1.Build.html https://docs.openshift.org/latest/rest_api/apis-build.openshift.io/v1.BuildConfig.html All openshift requests happen through OSO proxy Sample requests

wss://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/buildconfigs?watch=true&access_token=$ACCESS_TOKEN
Request Method:GET
Status Code:101 Switching Protocols
wss://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/builds?watch=true&access_token=$ACCESS_TOKEN
Request Method:GET
Status Code:101 Switching Protocols

Some relevant links https://github.com/fabric8-ui/fabric8-ui/blob/master/src/a-runtime-console/kubernetes/model/build.model.ts#L107 Seems like it is using annotations from build model

yzainee-zz commented 6 years ago

I am documenting my findings here : https://docs.google.com/document/d/1yDS7tkRFkHFdLK3iMCHkXvDO8TYvln8_NpeaUSRLBEA/edit

chmouel commented 6 years ago

this document is not available fyi

kishansagathiya commented 6 years ago

@chmouel I think the document is private. @yzainee I am not a big fan of documents. I don't think anyone is around here. Let's just use this issue so that everyone can see it and all your notes are easily accessible.

krishnapaparaju commented 6 years ago

@kishansagathiya if we can manually trigger API at Jenkins Proxy / simulate calling in of Webhook, we can get to simulation of load

chmouel commented 6 years ago

@krishnapaparaju what are we trying to test? proxy, idler or if the openshift backend can idle and undile properly ?

kishansagathiya commented 6 years ago

@chmouel I think the intention here is to understand end to end flow in the name of load testing.

yzainee-zz commented 6 years ago

Status of the build is tracked via wss. Need to check how can we get info from this, which we can use to understand build pass/fail when triggering via API call.

Running Build via webhook

Request URL: https://jenkins.openshift.io/github-webhook/ POST

kishansagathiya commented 6 years ago

@krishnapaparaju Was able to trigger a build via webhook request using this curl request (this one is with prod preview)

Next is to load test this with something like wrk. Let me know if you have any suggestions.

kishansagathiya commented 6 years ago

@krishnapaparaju It uses https://github.com/wg/wrk. wrk is pretty simple and was easy to start with. On Prod-preview

[ksagathi@localhost wrk]$ ./wrk -s /home/ksagathi/wrk/loadTestingGithubWHProdPreview.lua -t 12 -c 400 -d 1m https://jenkins.prod-preview.openshift.io/github-webhook/
Running 1m test @ https://jenkins.prod-preview.openshift.io/github-webhook/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.32s   384.77ms   2.00s    60.22%
    Req/Sec    14.57     15.56   120.00     89.30%
  3300 requests in 1.00m, 1.21MB read
  Socket errors: connect 0, read 0, write 0, timeout 2664
  Non-2xx or 3xx responses: 168
Requests/sec:     54.91
Transfer/sec:     20.60KB
[ksagathi@localhost wrk]$

kishansagathiya commented 6 years ago

@krishnapaparaju This one is on prod

[ksagathi@localhost wrk]$ ./wrk -s /home/ksagathi/wrk/loadTestingGithubWHProd.lua -t 12 -c 400 -d 10s https://jenkins.openshift.io/github-webhook/
Running 10s test @ https://jenkins.openshift.io/github-webhook/
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.60s   223.95ms   2.00s    77.91%
    Req/Sec    16.60     17.05   120.00     89.61%
  799 requests in 10.10s, 288.06KB read
  Socket errors: connect 0, read 0, write 0, timeout 541
  Non-2xx or 3xx responses: 104
Requests/sec:     79.12
Transfer/sec:     28.52KB
[ksagathi@localhost wrk]$

yzainee-zz commented 6 years ago

Results from prelim test results using locust

kishansagathiya commented 6 years ago

@krishnapaparaju Interesting note here is that On both prod and prod preview triggering builds takes much more time (heterogeneously)

yzainee-zz commented 6 years ago

How to test using locust. locust -f loc_test.py --host=https://jenkins.openshift.io After running this, open http://127.0.0.1:8089/ and enter the details

loc_test.txt

kishansagathiya commented 6 years ago

So this is the plan:-

Tests are for just one user( one tenant) These are the test scenarios

Send a build request at every 3 minutes and make sure that the build is triggered.
Wait for jenkins to get idled(wait for 45 min). Check if jenkins is idled. Send a build request. Check if build is triggered. Check how much time it is taking (time taken to unidle the jenkins and time taken for triggering the build)
Wait for jenkins to get idled(wait for 45 min). Check if jenkins is idled. Hit jenkins url (jenkins.openshift.io in case of prod). Check if jenkins is unidled. Check how much time it is taking (time taken to unidle the jenkins)
Wait for jenkins to get idled(wait for 45 min). Check if jenkins is idled. Send a build request. Check if build is triggered. Check how much time it is taking (time taken to unidle the jenkins and time taken for triggering the build). Wait until build is finiished. Check how much time it is taking to finish a build. Check if build fails.

jaseemabid commented 6 years ago

@kishansagathiya This is not load test, but basic integration test. wrk, locust, ab etc are wrong tools for this. The python scripts @sthaha wrote for end to end testing is much closer to this goal.

This started as load test, then 'using load test as a way to learn internals' and now its going on a 3rd tangent. We need clarity and focus here.

@yzainee @krishnapaparaju, what are we really trying to achieve here?

yzainee-zz commented 6 years ago

I have completed the 1st point out of these last night, i.e trigger a build every 'x' minutes. Currently set to 5 mins. Lets divide the other tasks

jaseemabid commented 6 years ago

@yzainee How is triggering a build every 5 minutes "Jenkins Proxy load testing" as the title suggests?

yzainee-zz commented 6 years ago

@jaseemabid I agree.. its not load testing. The earlier idea was to trigger several builds together every 'x' minutes and a different set every 'y' minutes (keeping idling time in mind). Its a little deviated from the load testing to end to end understanding and unearthing new problems (if any) while doing it.

jaseemabid commented 6 years ago

@yzainee If your goal is to write integration tests, @sthaha has extensive experience with the subject and recently started https://github.com/fabric8-services/fabric8-build-tests for the same purpose - not just for proxy, but the build service as whole. Please work with him on that repository.

sthaha commented 6 years ago

@kishansagathiya, @yzainee @krishnapaparaju please correct me if I am wrong but what I gather from the discussions are that we are interested in certain metrics e.g.

how long does it take for the idler to idle and wake jenkins up
how long does it take for jenkins to actually start,
what is cpu, memory, io usage of jenkins
how long does builds take

To me all these points in the direction of collecting metrics for which there are other ways such as

https://wiki.jenkins.io/display/JENKINS/Metrics+Plugin
https://github.com/jenkinsci/prometheus-plugin
build similar mechanisim into idler and proxy to export the metrics into Prometheus

WDYT?

jaseemabid commented 6 years ago

We have investigated https://github.com/jenkinsci/prometheus-plugin once and it didn't look very good to me then. See prior discussion here https://github.com/fabric8io/fabric8-build-team/issues/4. @lordofthejars might have more insights about metrics since he worked on it recently.

We are mixing metrics and integration tests here in the name of load testing.

lordofthejars commented 6 years ago

Yes also talking with @aslakknutsen mentioned me the same that Prometheus Jenkins didn't look so well. But we can try to spend some time on it if we all agree to create a time box task regarding this.

Regarding metrics. Currently by default in Idler and proxy cpu, memory, ... are already monitored. In case of Idler, the operations like Idle, UnIdle are also monitored.

Then, the truth is that now Jenkins instance is not Prometheus aware, and I am not sure of there are any other way rather than using the plugin.

Also I don't know which part is the responsible of monitoring bootup time. Maybe this is in Proxy side, but not sure about that.

@chmouel any hints about that?

kishansagathiya commented 6 years ago

@krishnapaparaju @yzainee Added a simple test scenario which will check if unidling is happening in 5 min. Will build up on this to create more and better scenarios. Also as a by-product of this task I have created https://github.com/fabric8-services/fabric8-jenkins-idler/pull/210

kishansagathiya commented 6 years ago

@krishnapaparaju Added one more scenario https://github.com/kishansagathiya/build-load-tests/blob/master/scenario_2.go Eventually I will move the previous scenario as well to golang

kishansagathiya commented 6 years ago

@krishnapaparaju I am adding my work in https://github.com/kishansagathiya/build-load-tests Once done I will move this repo on https://github.com/fabric8-services or will merge this in some current repository

kishansagathiya commented 6 years ago

@krishnapaparaju OpenShift has all builds related data that we are looking for. We can get it by GET builds API

curl -k     -H "Authorization: Bearer $ACCESS_TOKEN"     -H 'Accept: application/json'     https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ksagathi-preview/builds | jq .items[0].metadata.annotations

Here openshift.io/jenkins-status-json is what we are interested in. It has all the details that we are looking for. What failed, what ran, when started, when ended, duration for each build phase, etc. And this info is there for each build that has ever happened on OSIO. Take a look at my jenkins status json https://github.com/kishansagathiya/build-load-tests/blob/master/jenkins-status.json

yzainee-zz commented 6 years ago

@kishansagathiya slight correction.. $build-number here doesnt correlate to the build number, its rather the index of the array. This command will contain a list of all the builds of all the workspaces. We can add a filter to the above command to get exact build status.. curl -k -H "Authorization: Bearer $ACCESS_TOKEN" -H 'Accept: application/json' https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/yzainee-preview/builds | jq '.items[] | select(.metadata.name == "workSpaceName-buildNumber" )'

FYI @krishnapaparaju

kishansagathiya commented 6 years ago

@sthaha I think this has gone out of our priorities or isn't relevant anymore. Closing this. Feel free to reopen if you think differently.

fabric8-services / fabric8-jenkins-proxy

Get insights about builds failures #214

Running Build via webhook