Performance feature not easily able to test a stressed microservice

NikCanvin commented 4 years ago

Codewind version: 0.5.0 OS: MacOs

Che version: na IDE extension version: 0.5.0 IDE version: VSCode Kubernetes cluster: na

Description: As part of a perf BLOG in progress here: https://github.ibm.com/dev-ex/devAdvocacy/issues/147 ... I have tried to use the PerformanceDashboard for 2 days, to demo a perf enhancement in Node.js v13 (from v12). .. When I load the microservice up, not all the requests hit the microservice as expected, instead loadrunner buffers the requests until the microservice is free to process them. The microservice response time is technically correct, however, I'd expect all the requests to be fired at the microservice per my parameters (in edit load swettings) and the actual reponse time to be much longer. I think this is a BUG.

I suspect we also need to add new features:

to stop the test after the duration parameter has been reached, to prevent long waits for the responses in the test the finish..
an even better featrure though, would be if Codewind performance testing could 'ramp' up the load until a crash / or the reponse time hits a threshold, to help the Developer know the maximum limit of the current microservice, as at the moment the Developer has to experiment to find the limits and that can take hours/days

Steps to reproduce:

Workaround:

NikCanvin commented 4 years ago

I think this is a bug, becuase the performance feature promises to help the user detect performance changes between code changes -- however the data is broken due to codewind architectural behaviour/deisgn -- so the feature does not do what the user expects.. .. I realise though that it's a big change, so we consider it more 'feature-like' to address

NikCanvin commented 4 years ago

To @tobespc .. .. following 0.7.0 release, with a new metrics architecture ... I retested the performance feature.

Unfortunely, the same issue here is seen: Screenshot 2019-12-20 at 15 52 51

Identicle runs causes the performace to degrade (despite no code changes).. an App restart restores the performance, until multi further runs cause degrading again!

markcor11 commented 4 years ago

The metrics collections are sampled and recorded by AppMetrics which runs within the project container. Codewind PFE tells the project to start and stop collecting metrics using a few different endpoints :

START _RECORDING : POST http://127.0.0.1:8000/appmetrics/api/v1/collections (which returns a collection ID)
STOP_RECORDING : GET http://127.0.0.1:8000/appmetrics/api/v1/{connectionID}
DELETE_RECORDING: DELETE http://127.0.0.1:8000/appmetrics/api/v1/{connectionID}

There should be one collection per load run which will include summary data similar to :

json{
    "id": 2,
    "time": {
        "data": {
            "start": 1580131139857,
            "end": 1580131165304
        },
        "units": {
            "start": "UNIX time (ms)",
            "end": "UNIX time (ms)"
        }
    },
    "cpu": {
        "data": {
            "systemMean": 0.0429953625,
            "systemPeak": 0.0532637,
            "processMean": 0.0016209041249999998,
            "processPeak": 0.0037047
        },
      ......
}

Each summary contains a start timestamp and an end snapshot and is the duration of the recording between when recorded started and ended.

"start": 1580131139857, "end": 1580131165304

Ideally, load is run against the project during the time the metrics are being recorded. Currently Codewind PFE applies load pressure based on project specific properties for example 60 seconds and that should coincide with the duration of the metrics recorder eg:

Start recording metrics
Start load test time for 60 seconds
Apply load
Apply load
Apply load
Apply load
Apply load
Apply load
Load timer expires
Stop recording metrics

You might expect the metrics to only be collected for 60 seconds however that may not always be the case and I think that's part of the bug.

What is actually happening is :

Start recording metrics
Start load test time for 60 seconds
Apply load
Apply load
Apply load
Apply load
Apply load
Apply load
Load timer expires
Apply load                           <-- lots of outstanding inflight load requests
Stop recording metrics

Under extreme load, the project is still busy trying to handle and provide a response to the inflight requests from loadrunner. It may not process the request to turn off the metrics recorder for some time after the load run has finished. That means we can not guarantee that the metrics summary is for only the requests captured during a specific time window since some project URLs may continue to stream into the container and be measured until the collection STOP is received.

One way around this would be to have AppMetrics turn ON and turn OFF the metrics collection within the project container itself rather than be told to. If AppMetrics started a timer at the point where recording started, it would keep recording until that timer expired regardless of load ending late. When Codewind then asks for the metrics it would retrieve the recorded snapshot just for that time window. This would be a change to AppMetrics, JavaMetrics and SwiftMetrics but it would get us closer to measure load for x minutes which we currently do not do.

I'm not saying that will solve the entire issue about worsening performance after each load run, but we have to at least get to a point where the expected run duration is consistent and not left open to outside influences.

markcor11 commented 4 years ago

Ran a test script to check behaviour based on above :

Start Recording metrics
Apply load for 10 seconds
Sleep 15 seconds (simulate external requests still arriving)
Turn off recording

results :

{"id":3,"time":{"data":{"start":1580135082273,"end":1580135107995},.....

Duration = 1580135107995 - 1580135082273 = 25.7 seconds

The 25 seconds includes the 15 of sleep which is where any number of requests could still arrive in the project and skew the summaries.

markcor11 commented 4 years ago

Made some changes to app metrics and now can get the timed collection to expire within a few milliseconds of a timed capture.

Work completed so far :

[x] Appmetrics-Codewind : https://github.com/markcor11/appmetrics-codewind/tree/954-TimedMetricsCollection
[x] Javametrics/Codewind : https://github.com/markcor11/javametrics/tree/954-TimedMetricsCollection
[ ] Wire up PFE to use the 2 new metrics collection APIs

markcor11 commented 4 years ago

New endpoints for controlling metrics :

for node metricsName = appmetrics for java metricsName = javametrics

list running connections : GET http://{project}/{metricsName}/api/v1/collections
start running a collection : POST http://{project}/{metricsName}/api/v1/collections/{timeInSeconds}
retrieve and delete saved collection : GET http://{project}/{metricsName}/api/v1/collections/{collectionID}/stashed

markcor11 commented 4 years ago

/close

eclipse-archived / codewind

Performance feature not easily able to test a stressed microservice #954