Closed yzainee-zz closed 6 years ago
@yzainee I know I shouldn't talk about performance without numbers, but I'm still trying this out.
Proxy: Isn't the primary workload connecting 2 network requests and waiting for them in the middle? Shouldn't go be able to scale really well for this workload because its easily network bound? I'd expect the number of requests it can handle to be close enough to (network bandwidth / size of a request) with well written concurrent code. If we come up with a number N for the number of requests we can handle, I'd like to be reported as a fraction of this expected upper bound.
Idler: Same case as above, but I think primary bottleneck will be the time taken for the openshift cluster to idle/unidle. We probably might have to do nothing at all here to reach perf upper bound.
@jaseemabid We will get to know about that once we perform this exercise. Having said that, what we are trying to infer is, if instead of running 1 proxy, if we can run multiple proxy using queues, does it make a difference in the performance.
@yzainee FYI Formatted the github issue for you,
Let's make sure we chat with SD team first if we have doing any sort of load testing on any platforms (prodpreview, prod), /cc @kbsingh
From a service reliability standpoint we should run a separate idler per cluster, but I doubt if there will be any major performance changes. There are several reports of go servers handling millions of requests per minute for IO bound workloads online.
let's gather numbers first before we assume what needs to be done and focus on that,
we need to think about deployment complexity too, if there is not much gains and it become hard to deploy then we should drop it.
This is what I see with the pipeline
On starting the Pipeline:-
Request URL:https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/buildconfigs/mydemo2%2Finstantiate
Request Method:OPTIONS
Status Code:204 No Content
Request URL:https://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/buildconfigs/mydemo2%2Finstantiate
Request Method:POST
Status Code:201 Created
It requires github Token for build Release:
Request URL:https://auth.openshift.io/api/token?for=https://github.com
Request Method:GET
Status Code:200 OK
On Approving to roll out:-
Request URL:https://forge.api.openshift.io/api/openshift/services/jenkins/ksagathi-jenkins/job/kishansagathiya/job/mydemo2/job/master/1/input/Proceed/proceedEmpty
Request Method:OPTIONS
Status Code:200 OK
Request URL:https://forge.api.openshift.io/api/openshift/services/jenkins/ksagathi-jenkins/job/kishansagathiya/job/mydemo2/job/master/1/input/Proceed/proceedEmpty
Request Method:POST
Status Code:200 OK
We know about pipeline stage through Openshift APIs (by watching Build and BuildConfig struct)
https://docs.openshift.org/latest/rest_api/apis-build.openshift.io/v1.Build.html https://docs.openshift.org/latest/rest_api/apis-build.openshift.io/v1.BuildConfig.html All openshift requests happen through OSO proxy Sample requests
wss://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/buildconfigs?watch=true&access_token=$ACCESS_TOKEN
Request Method:GET
Status Code:101 Switching Protocols
wss://f8osoproxy-test-dsaas-production.09b5.dsaas.openshiftapps.com/oapi/v1/namespaces/ksagathi/builds?watch=true&access_token=$ACCESS_TOKEN
Request Method:GET
Status Code:101 Switching Protocols
Some relevant links https://github.com/fabric8-ui/fabric8-ui/blob/master/src/a-runtime-console/kubernetes/model/build.model.ts#L107 Seems like it is using annotations from build model
I am documenting my findings here : https://docs.google.com/document/d/1yDS7tkRFkHFdLK3iMCHkXvDO8TYvln8_NpeaUSRLBEA/edit
this document is not available fyi
@chmouel I think the document is private. @yzainee I am not a big fan of documents. I don't think anyone is around here. Let's just use this issue so that everyone can see it and all your notes are easily accessible.
@kishansagathiya if we can manually trigger API at Jenkins Proxy / simulate calling in of Webhook, we can get to simulation of load
@krishnapaparaju what are we trying to test? proxy, idler or if the openshift backend can idle and undile properly ?
@chmouel I think the intention here is to understand end to end flow in the name of load testing.
Status of the build is tracked via wss. Need to check how can we get info from this, which we can use to understand build pass/fail when triggering via API call.
Request URL: https://jenkins.openshift.io/github-webhook/ POST
@krishnapaparaju It uses https://github.com/wg/wrk. wrk is pretty simple and was easy to start with. On Prod-preview
[ksagathi@localhost wrk]$ ./wrk -s /home/ksagathi/wrk/loadTestingGithubWHProdPreview.lua -t 12 -c 400 -d 1m https://jenkins.prod-preview.openshift.io/github-webhook/
Running 1m test @ https://jenkins.prod-preview.openshift.io/github-webhook/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.32s 384.77ms 2.00s 60.22%
Req/Sec 14.57 15.56 120.00 89.30%
3300 requests in 1.00m, 1.21MB read
Socket errors: connect 0, read 0, write 0, timeout 2664
Non-2xx or 3xx responses: 168
Requests/sec: 54.91
Transfer/sec: 20.60KB
[ksagathi@localhost wrk]$
@krishnapaparaju This one is on prod
[ksagathi@localhost wrk]$ ./wrk -s /home/ksagathi/wrk/loadTestingGithubWHProd.lua -t 12 -c 400 -d 10s https://jenkins.openshift.io/github-webhook/
Running 10s test @ https://jenkins.openshift.io/github-webhook/
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.60s 223.95ms 2.00s 77.91%
Req/Sec 16.60 17.05 120.00 89.61%
799 requests in 10.10s, 288.06KB read
Socket errors: connect 0, read 0, write 0, timeout 541
Non-2xx or 3xx responses: 104
Requests/sec: 79.12
Transfer/sec: 28.52KB
[ksagathi@localhost wrk]$
Results from prelim test results using locust
@krishnapaparaju Interesting note here is that On both prod and prod preview triggering builds takes much more time (heterogeneously)
How to test using locust. locust -f loc_test.py --host=https://jenkins.openshift.io After running this, open http://127.0.0.1:8089/ and enter the details
So this is the plan:-
Tests are for just one user( one tenant) These are the test scenarios
@kishansagathiya This is not load test, but basic integration test. wrk, locust, ab etc are wrong tools for this. The python scripts @sthaha wrote for end to end testing is much closer to this goal.
This started as load test, then 'using load test as a way to learn internals' and now its going on a 3rd tangent. We need clarity and focus here.
@yzainee @krishnapaparaju, what are we really trying to achieve here?
I have completed the 1st point out of these last night, i.e trigger a build every 'x' minutes. Currently set to 5 mins. Lets divide the other tasks
@yzainee How is triggering a build every 5 minutes "Jenkins Proxy load testing" as the title suggests?
@jaseemabid I agree.. its not load testing. The earlier idea was to trigger several builds together every 'x' minutes and a different set every 'y' minutes (keeping idling time in mind). Its a little deviated from the load testing to end to end understanding and unearthing new problems (if any) while doing it.
@yzainee If your goal is to write integration tests, @sthaha has extensive experience with the subject and recently started https://github.com/fabric8-services/fabric8-build-tests for the same purpose - not just for proxy, but the build service as whole. Please work with him on that repository.
@kishansagathiya, @yzainee @krishnapaparaju please correct me if I am wrong but what I gather from the discussions are that we are interested in certain metrics e.g.
To me all these points in the direction of collecting metrics for which there are other ways such as
WDYT?
We have investigated https://github.com/jenkinsci/prometheus-plugin once and it didn't look very good to me then. See prior discussion here https://github.com/fabric8io/fabric8-build-team/issues/4. @lordofthejars might have more insights about metrics since he worked on it recently.
We are mixing metrics and integration tests here in the name of load testing.
Yes also talking with @aslakknutsen mentioned me the same that Prometheus Jenkins didn't look so well. But we can try to spend some time on it if we all agree to create a time box task regarding this.
Regarding metrics. Currently by default in Idler and proxy cpu, memory, ... are already monitored. In case of Idler, the operations like Idle, UnIdle are also monitored.
Then, the truth is that now Jenkins instance is not Prometheus aware, and I am not sure of there are any other way rather than using the plugin.
Also I don't know which part is the responsible of monitoring bootup time. Maybe this is in Proxy side, but not sure about that.
@chmouel any hints about that?
@krishnapaparaju @yzainee Added a simple test scenario which will check if unidling is happening in 5 min. Will build up on this to create more and better scenarios. Also as a by-product of this task I have created https://github.com/fabric8-services/fabric8-jenkins-idler/pull/210
@krishnapaparaju Added one more scenario https://github.com/kishansagathiya/build-load-tests/blob/master/scenario_2.go Eventually I will move the previous scenario as well to golang
@krishnapaparaju I am adding my work in https://github.com/kishansagathiya/build-load-tests Once done I will move this repo on https://github.com/fabric8-services or will merge this in some current repository
@krishnapaparaju OpenShift has all builds related data that we are looking for. We can get it by GET builds API
curl -k -H "Authorization: Bearer $ACCESS_TOKEN" -H 'Accept: application/json' https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ksagathi-preview/builds | jq .items[0].metadata.annotations
Here openshift.io/jenkins-status-json
is what we are interested in. It has all the details that we are looking for. What failed, what ran, when started, when ended, duration for each build phase, etc. And this info is there for each build that has ever happened on OSIO.
Take a look at my jenkins status json https://github.com/kishansagathiya/build-load-tests/blob/master/jenkins-status.json
@kishansagathiya slight correction.. $build-number here doesnt correlate to the build number, its rather the index of the array. This command will contain a list of all the builds of all the workspaces. We can add a filter to the above command to get exact build status.. curl -k -H "Authorization: Bearer $ACCESS_TOKEN" -H 'Accept: application/json' https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/yzainee-preview/builds | jq '.items[] | select(.metadata.name == "workSpaceName-buildNumber" )'
FYI @krishnapaparaju
@sthaha I think this has gone out of our priorities or isn't relevant anymore. Closing this. Feel free to reopen if you think differently.
Currently we don't have an easy way to figure out what sub-component of the build system caused the failure. Lets fix this by instrumenting this functionality into the sub-components so that we can trace why/how a build failed.
This issue to tracks the following points: