Mapepire-IBMi / mapepire-server

Server-side support for Code for IBM i
GNU General Public License v3.0
23 stars 10 forks source link

Server Stability Problem - Server Becomes Unresponsive and High CPU Use #81

Open DavidRusso opened 4 weeks ago

DavidRusso commented 4 weeks ago

Hi,

I've been running some load/performance tests and have found that the server consistently becomes unresponsive under a certain amount of load.

I'm using the Node.js @ibm/mapepire-js client version 0.30 running on Windows and connecting to Mapepire server version 2.1.2-1 on IBM i.

Here is a simple Node.js program that can be used to reliably reproduce:

File: index.js

import "dotenv/config";
import assert from "assert/strict";
import mapepire from "@ibm/mapepire-js";

const creds = {
  host: process.env.IBMI_HOST,
  user: process.env.IBMI_USER,
  password: process.env.IBMI_PASSWORD,
  ignoreUnauthorized: true
};

const poolOpts = {
  creds,
  "maxSize": 20,
  "startingSize": 20
};

const pool = new mapepire.Pool(poolOpts);
await pool.init();

const DURATION_SECONDS = 120;
const QUERIES_PER_SECOND = 10;

const all = [];
for (let i = 0; i < DURATION_SECONDS; i++) {
  const queries = [];
  for (let q = 0; q < QUERIES_PER_SECOND; q++) {
    queries.push(new Promise((resolve, reject) => {
      const query = pool.query("select * from qsys2.systables");
      query.execute(500).then(result => {
        assert(result.data.length > 0);
        query.close().then(() => {
          resolve();
        });
      }).catch(reject);
    }));
  }
  await new Promise((resolve) => setTimeout(resolve, 1000));
  all.push(...queries);
  console.log(`Total queries initiated: ${all.length}`);
}

await Promise.all(all);
pool.end();

console.log(`${all.length} queries completed`);

File: package.json

{
  "name": "mape",
  "version": "1.0.0",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "start": "node ."
  },
  "author": "",
  "license": "ISC",
  "description": "",
  "dependencies": {
    "@ibm/mapepire-js": "^0.3.0",
    "dotenv": "^16.4.5"
  }
}

To use:

The problems usually start after about 750 queries are initiated:

This should reproduce the first/every time. If not, try increasing the DURATION_SECONDS value.

Interestingly, the problem does not occur with DURATION_SECONDS set to 45 -- all queries complete quickly/normally and server is in normal state.

My QZDASOINIT prestart job entry in QUSRWRK subsystem is configured like this:

                      Display Prestart Job Entry Detail                   
                                                            System:   
Subsystem description:   QUSRWRK        Status:   ACTIVE                  

Program  . . . . . . . . . . . . . . . . . . . . :   QZDASOINIT           
  Library  . . . . . . . . . . . . . . . . . . . :     QSYS               
User profile . . . . . . . . . . . . . . . . . . :   QUSER                
Job  . . . . . . . . . . . . . . . . . . . . . . :   QZDASOINIT           
Job description  . . . . . . . . . . . . . . . . :   QDFTSVR              
  Library  . . . . . . . . . . . . . . . . . . . :     QGPL               
Start jobs . . . . . . . . . . . . . . . . . . . :   *YES                 
Initial number of jobs . . . . . . . . . . . . . :   20                   
Threshold  . . . . . . . . . . . . . . . . . . . :   20                   
Additional number of jobs  . . . . . . . . . . . :   20                   
Maximum number of jobs . . . . . . . . . . . . . :   *NOMAX               
Maximum number of uses . . . . . . . . . . . . . :   200                  
Wait for job . . . . . . . . . . . . . . . . . . :   *YES                 
Pool identifier  . . . . . . . . . . . . . . . . :   1                    
jonnyz32 commented 4 weeks ago

@DavidRusso Did you capture any server logs?

DavidRusso commented 4 weeks ago

@jonnyz32 , Log files in ~/.sc/logs are entirely empty. Also nothing in MAPEPIRE server job logs. No activity or messages in QZDASOINIT pool jobs.

worksofliam commented 4 weeks ago

@DavidRusso I am setting this up across a few systems I have access to. Out of curiosity, do you find your test script passing if you use a select statement that returns a smaller result set?

Perhaps something like select * from qsys2.systables limit 50. This will help me determine where this might be happening in the server. Thanks!

DavidRusso commented 4 weeks ago

@worksofliam ,

Looks like you're on the right track. I found that If I add limit 100 to the select statement and also remove the 500 parameter from the query.execute() call, the problem does not occur.

worksofliam commented 4 weeks ago

@DavidRusso Another 'out of curiosity', have you tried running Mapepire-server locally with your test case? I am recreating your problem on the server, and I am able to start debugging. Just wondering if you tried this at all? No worries if not.

DavidRusso commented 4 weeks ago

@worksofliam , no, I have not tried running locally. I've only tried the packaged server version from the IBM i yum repos.

worksofliam commented 3 weeks ago

I have been debugging this a little bit with the server running on my local machine.

I see this in my logger:

Total queries initiated: 1200

This line of code gets hit:

await Promise.all(all);

And from there is truly just waiting for all 1200 statements to be returned. I see a lot of threads on the Java end (perhaps too many, looking at this) and also see the client waiting for all the responses.

More info coming soon

worksofliam commented 3 weeks ago

For server recreation:

Input query to the server that is causing the hang:

{"id":"query51","type":"sql","sql":"select * from qsys2.systables","rows":500}

Though it does produce a valid response:

{"id":"query51","has_results":true,"update_count":-1,"metadata":{...},"data":[...],"is_done":false,"success":true}
worksofliam commented 3 weeks ago

I think I've tracked it down to the post-processing of the result set.

If you use execute(100) then it's significantly faster. I have spent some debugging and I've tracked it down to this method:

https://github.com/Mapepire-IBMi/mapepire-server/blob/dc7bbbd68ca8249c4195ab02d084d6a4eba867b8/src/main/java/com/github/ibm/mapepire/requests/BlockRetrievableRequest.java#L103

I will post back with more details with a plan.

bashdi commented 2 days ago

@worksofliam i think the server is running out of memory. The jvm-settings in the mapepire-start.sh sets the heap size to 1G:

-Xsoftmx1G -Xmx16g -Xms256M

-Xsoftmx

The js-script defintly makes my, limited to 1G and on my windows machine running, mapepire-server throw OutOfMemoryError-Exceptions and unresponsive.

image

After aprox. 990 queries initiated: image image

I couldnt find a use of the api, which lets the server change the "soft-memory-limit", in the server code. I think we can remove it?