ballerina-platform / ballerina-library

The Ballerina Library
https://ballerina.io/learn/api-docs/ballerina/
Apache License 2.0
136 stars 64 forks source link

HTTP resources fails at a lower concurrency than expected. #6666

Closed manuranga closed 2 months ago

manuranga commented 4 months ago

Description: HTTP resources fail at a lower concurrency than expected. It fails around 30 threads on my laptop, while similar java application can tolerate more than 200 (could be even higher).

Steps to reproduce: create a simple service

import ballerina/http;

listener http:Listener apiListener = new(8080);

isolated service / on apiListener {
    resource isolated function get hello() returns error|json {
        return {"hello": "world"};
    }
}

Stress test with JMeter. test-bal-threads.txt (rename .txt to .jmx) Affected Versions: Ballerina 2201.9.1

Related Issues (optional): https://discord.com/channels/957996897782616114/1017493204326686860/threads/1255528309769506969

manuranga commented 4 months ago

In my machine issues doesn't happen in 2201.4.3 But issue exists in 2201.5.0

TharmiganK commented 4 months ago

The JMX script using a stepping thread group to identify the maximum concurrent threads which can be handled by the server without a timeout.

There seems to be some interruption errors in the jmeter client side. Will get a thread dump for further analysis

TharmiganK commented 4 months ago

The issue here is the jmeter plugin client's behaviour. I saw that regardless of HTTP2Sampler.http1_upgrade property the client is sending a mixed of HTTP/1.1 requests and HTTP/1.1 upgrade requests(There is no HTTP/2 requests)(There is also a known issue in this plugin). This was verified by wireshark.

Screenshot 2024-06-28 at 13 57 32

In the above screenshot, the very first request is coming as a HTTP/1.1 upgrade but the next request is a normal HTTP/1.1 request. Also there is a unexpected GO_AWAY from the client to close the established HTTP/2 connection. This is when we make HTTP2Sampler.http1_upgrade property true. When we set it to false, the very first request is normal HTTP/1.1 and the next request is HTTP/1.1 Upgrade. (I used a single user to check this)

Due to these factors, the number of established connections are keep increasing since the ballerina server side does not limit the connections made by the client. So, there is an interruption exception occurred in the client used in the jmeter plugin.

As the core jmeter does not have a HTTP/2 client, it is not advisable to trust the results of these plugins without verifying its behaviour. We recommend to use h2load as a bench marking tool for HTTP/2. You can do a similar stress test using the following script:

#!/bin/bash

URL="http://localhost:8080/hello"
TOTAL_REQUESTS=100000
INITIAL_CONCURRENCY=1
MAX_CONCURRENCY=200
STEP=1

for ((c=$INITIAL_CONCURRENCY; c<=$MAX_CONCURRENCY; c+=$STEP))
do
  echo "Running with concurrency: $c"
  h2load -n $TOTAL_REQUESTS -c $c $URL
  sleep 2  # Pause between steps, adjust as necessary
done
TharmiganK commented 4 months ago

Keeping this issue open, to check the server's behaviour when we get an unexpected GO_AWAY from the client. @dilanSachi is checking on this

manuranga commented 4 months ago

From my observation, this GO_AWAY comes after 20s. Looks like the client is just holding and closing it after a set timeout.

dilanSachi commented 4 months ago

Keeping this issue open, to check the server's behaviour when we get an unexpected GO_AWAY from the client. @dilanSachi is checking on this

At the moment, ballerina service does not act to the GO_AWAY from the client. The connection is kept open until client initiates a connection close. This can be seen in the following server trace logs.

[2024-07-03 15:06:02,057] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, correlatedSource: n/a, host:/127.0.0.1:8080 - remote:/127.0.0.1:37470] ACTIVE  
[2024-07-03 15:06:02,110] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] OUTBOUND SETTINGS: ack=false settings={INITIAL_WINDOW_SIZE=65535, MAX_HEADER_LIST_SIZE=8192}  
[2024-07-03 15:06:02,113] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, correlatedSource: n/a, host:/127.0.0.1:8080 - remote:/127.0.0.1:37470] OUTBOUND: << Payload could not be decoded >>  
[2024-07-03 15:06:02,120] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] INBOUND SETTINGS: ack=false settings={HEADER_TABLE_SIZE=25700}  
[2024-07-03 15:06:02,122] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] OUTBOUND SETTINGS: ack=true  
[2024-07-03 15:06:02,928] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] INBOUND SETTINGS: ack=true  
[2024-07-03 15:06:02,934] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] INBOUND HEADERS: streamId=3 headers=DefaultHttp2Headers[:path: /hello, :method: POST, content-type: text/plain, user-agent: wso2] padding=0 endStream=false  
[2024-07-03 15:06:03,929] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] INBOUND DATA: streamId=3 padding=0 endStream=true length=11 data=11B
hellosworld  
Payload received hellosworld
1
[2024-07-03 15:06:03,974] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] OUTBOUND HEADERS: streamId=3 headers=DefaultHttp2Headers[:status: 201, content-type: application/json, server: ballerina, date: Wed, 3 Jul 2024 15:06:03 +0530] padding=0 endStream=false  
[2024-07-03 15:06:03,980] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] OUTBOUND DATA: streamId=3 padding=0 endStream=true length=17 data=17B
{"hello":"world"}  
[2024-07-03 15:06:05,930] TRACE {http.tracelog.downstream} - [id: 0x24bd4c93, L:/127.0.0.1:8080 - R:/127.0.0.1:37470] INBOUND GO_AWAY: lastStreamId=3 errorCode=11 length=0 bytes=

However, this can cause issues with malformed clients. @TharmiganK observed something similar to this with the above client. Here, the client opens a connection and do some communication and send a GO_AWAY, but does not close the connection. Then the client opens another connection and do some comm and send a GO_AWAY again. If this happens repeatedly, there will be too many open connections that cannot do any communication and this is a wastage. So we offline discussed to add a feature for Ballerina http2 server that would act upon receiving GO_AWAY frames. The proposed approach is to wait for sometime until the client closes the connection/wait until the existing streams complete the communication and close the connection from the server side.

TharmiganK commented 2 months ago

Improved the behaviour by handling the GO_AWAY from server side with a stale eviction implementation