ballerina-platform / ballerina-lang

The Ballerina Programming Language
https://ballerina.io/
Apache License 2.0
3.54k stars 733 forks source link

High error rate for HTTPs loadtests with 200 concurrent users #42628

Closed xlight05 closed 2 weeks ago

xlight05 commented 3 weeks ago

Description: $Subject. This works fine for the 60 user case. Applies to both passthrough and transformation usecase we have.

https://github.com/ballerina-platform/ballerina-performance-cloud/actions/runs/8782386068/job/24096556961 https://github.com/ballerina-platform/ballerina-performance-cloud/blob/1d735a28d62a06265a7ccad3c72bfa78764b476a/load-tests/https_passthrough/results/summary.csv#L2609

Steps to reproduce:

Affected Versions:

OS, DB, other environment details and versions:

Related Issues (optional):

Suggested Labels (optional):

Suggested Assignees (optional):

TharmiganK commented 3 weeks ago

Tried running the existing load tests with 200 concurrent requests using the workflow.

Workflow run: https://github.com/ballerina-platform/ballerina-performance-cloud/actions/runs/8794152427 Results: https://github.com/ballerina-platform/module-ballerina-http/pull/1964/files

There were some significant differences between the code used in the https_passthrough load-test and the code used in the h1_h1_passthrough load-test.

  1. The one in the ballerina-performance-cloud uses a h2-h2 approach where as the one in the http module uses h1-h1
  2. The one in the ballerina-performance-cloud uses http:Caller to respond where as the one in the http module just return the http:Response

So I tried by adding two more load-tests: h2_h2_passthrough and h2_transformation but still getting 0% error rate:

Workflow run: https://github.com/ballerina-platform/ballerina-performance-cloud/actions/runs/8797283914 Results: https://github.com/ballerina-platform/module-ballerina-http/pull/1963/files

I have tried to reproduce this issue locally using the the code in the https_passthrough and running load-test with 200 concurrent users for 5 minutes. But I could not reproduce the issue.

@xlight05 Can we check on the configurations used to run this load-tests?

xlight05 commented 3 weeks ago

Had an offline chat on this. We were able to get a stand dump when this issue was reproduced.

Strand dump - https://gist.github.com/xlight05/9ef16bbe1ea7f733d43a398429920a32

TharmiganK commented 3 weeks ago

I was able to reproduce this issue with the help of @xlight05 in a constraint environment. Please find the below steps:

  1. Clone the following repo: https://github.com/xlight05/bal_https_hello
  2. Run bal build
  3. Run docker-compose up
  4. Run the load test using this JMX file: https://gist.github.com/TharmiganK/8f78d8a3ec820661c4fdab7ee723ad7e/

Please note that this issue is only reproducible when you make multiple requests at a small interval. Strangely, if we make only one request at first and wait for the response then the subsequent requests are passing.

I have checked the following:

  1. With update 9 - Failing
  2. With update 9 and without http changes - Failing
  3. With update 8 and new http changes - Passing

So it seems the issue is coming from lang with update 9 changes. Adding @HindujaB @gabilang to check on this

TharmiganK commented 3 weeks ago

I was able to reduce the reproducer code with this: (no need for docker, just use bal run)

import ballerina/http;

listener http:Listener securedEP = new (9090);

final http:Client nettyEP = check new ("http://localhost:8688");

service /passthrough on securedEP {
    resource function post .(http:Request clientRequest) returns http:Response|error {
        return nettyEP->/'service/EchoService.post(clientRequest);
    }
}

But in order to reproduce, I have to use 1000 users with 5s ramp-up period. (I checked the similar configuration with update 8 service and it was working without any hanging.)

If I remove the clientRequest from the resource signature then it is working without any hanging issue. So this might be related to the previous memory issue: https://github.com/ballerina-platform/ballerina-lang/issues/42566. The difference here is there is no memory increase now but some strands used to populate the default values seems to be in runnable state.

Please note that I have removed SSL here, so not 100% sure that both of these are related. (With SSL also the service is hanging). But I think with SSL, the probability of this issue occurrence is high.

When hanging most of the jbal threads are in monitor state: image

Strand dump: https://gist.github.com/TharmiganK/932d0274a391aa55f8fbe9e9da5135a1 Thread dump: https://drive.google.com/file/d/14Y4x7b5Vdm-8RCT_VyLTQSC8sSsqSfbT/view?usp=drive_link

github-actions[bot] commented 2 weeks ago

This issue is NOT closed with a proper Reason/ label. Make sure to add proper reason label before closing. Please add or leave a comment with the proper reason label now.

      - Reason/EngineeringMistake - The issue occurred due to a mistake made in the past.
      - Reason/Regression - The issue has introduced a regression.
      - Reason/MultipleComponentInteraction - Issue occured due to interactions in multiple components.
      - Reason/Complex - Issue occurred due to complex scenario.
      - Reason/Invalid - Issue is invalid.
      - Reason/Other - None of the above cases.