Server connection pool getting filled by idle "wait: ClientRead" connections when sending queries rapidly

robacarp commented 2 years ago

Hey, thanks for this node-red node. I was able to get it up and running pretty quick, but I'm running into a small issue. Simple insert queries are hanging when fired off too fast. It causes the connection pool on the server to fill up, which causes later queries to fail due to "Error: timeout exceeded when trying to connect".

A workaround is to add a rate-limit delay node, which tells me that something is getting forgotten when more than one query is being run at a time. I believe this would be insufficient if I had many query nodes in my flows because any collision would leave a connection hanging and eventually lead to pool exhaustion.

I'm not doing anything too clever. Postgres is installed and connected over 127.0.0.1, and it's a clean install of Postgres 14. No other applications are connecting at this time -- I have shut down grafana, which is the only other thing connected to the server.

My query is this:

INSERT INTO "public"."metrics"
("name", "metric", "units", "type")
VALUES('{{{ msg.topic }}}', {{{ msg.payload.value }}}, 'ºF', '{{{ msg.payload.name }}}')

It gets expanded correctly and is received by postgres like this, which I pulled from the query log:

INSERT INTO "public"."metrics"
("name", "metric", "units", "type")
VALUES('Sensor 2', 39.46, 'ºF', 'temperature')

The first batch of events query node execute fine and data ends up in the table. But the n-1 queries never finishes and the connection is left hanging open:

> psql
psql (14.0)
Type "help" for help.

robert=# select client_addr, wait_event, state, query from pg_stat_activity;
 client_addr |     wait_event      | state  |                                                               query
-------------+---------------------+--------+-----------------------------------------------------------------------------------------------------------------------------------
             | AutoVacuumMain      |        | 
             | LogicalLauncherMain |        | 
             |                     | active | select client_addr, wait_event, state, query from pg_stat_activity;
 127.0.0.1   | ClientRead          | active | INSERT INTO "public"."metrics" ("name", "metric", "units", "type") VALUES('Sensor 1', 60.35, 'ºF', 'temperature')
 127.0.0.1   | ClientRead          | idle   | INSERT INTO "public"."metrics" ("name", "metric", "units", "type") VALUES('Sensor 2', 39.46, 'ºF', 'temperature')
 127.0.0.1   | ClientRead          | active | INSERT INTO "public"."metrics" ("name", "metric", "units", "type") VALUES('Sensor 2', 39.46, 'ºF', 'temperature')
 127.0.0.1   | ClientRead          | idle   | INSERT INTO "public"."metrics" ("name", "metric", "units", "type") VALUES('Sensor 2', 39.46, 'ºF', 'temperature')
             | BgWriterMain        |        | 
             | CheckpointerMain    |        | 
             | WalWriterMain       |        | 
(10 rows)

These connections listed here will never close -- triggering my flow again fires 5 queries and 4 of them will never correctly finish.

Here are some screenshots:

Node-Red flow:

A working node-red flow:

Alkarex commented 2 years ago

Hello, I will try to see if there is anything in the node itself, which could be improved. But in any case, there are multiple ways to improve performance. An important one is to bundle several INSERT lines together. Instead of making one request per line you want to insert, put several lines into the same INSERT request. In your case, put at least the 5 sensors in the same request, like so:

INSERT INTO "public"."metrics" ("name", "metric", "units", "type")
VALUES
('Sensor 1', 39.46, 'ºF', 'temperature'),
('Sensor 2', 39.46, 'ºF', 'temperature'),
('Sensor 3', 39.46, 'ºF', 'temperature'),
('Sensor 4', 39.46, 'ºF', 'temperature'),
('Sensor 5', 39.46, 'ºF', 'temperature')

robacarp commented 2 years ago

@Alkarex thanks for looking into it. In some circumstances I'm able to batch inserts, but not always. In the flow I attached, each sensor sends data whenever the temperature has changed. Sometimes they will change at the same moment but most often they will change at unrelated moments.

I started noticing this hung connection problem because sensors would correctly log for a day or two and then suddenly all logging would stop and the node-red console would fill up with "Error: timeout exceeded when trying to connect". messages. Restarting node-red did not fix the problem, I have to restart postgres in order to clear the hung connections. I was lucky to be able to reproduce the problem by firing so many inserts at once.

ahanselka commented 2 years ago

I'm having a similar issue to @robacarp, except mine seems to be getting caught up with SELECT statements. It doesn't happen super often or cause me huge problems, but I figured I'd note that it can happen with lots of SELECT as well. 🙂

Alkarex commented 2 years ago

If you have a scenario during which you have a burst of many requests during a short time period, you can increase the Connection Timeout parameter (in milliseconds):

It of course depends on the number of requests and the time the SQL requests take to be executed by PostgreSQL.

So for safely processing a large number of requests at maximum speed, I advise using the built-in backpressure mechanism whenever possible.

ahanselka commented 2 years ago

@robacarp I started having the same issue with INSERTS. What I did is use the batch and join nodes to aggregate the readings in 10 second intervals, then used a function node with knex to generate a msg.query for the Postgres node. It helped substantially given that I had gotten up to several inserts per second. Now I end up doing one insert every 10 or so seconds with 14 or so records each time and have not experienced any further issue.

wangnick commented 2 years ago

Similar issue over here. Under some conditions node-red-contrib-postgresql gets stuck in dangling queries and any further request is rejected with "Error: timeout exceeded when trying to connect". In my case the connection pool settings are the default 10, idle timeout 1000ms, connection timeout 10000ms, and the backlog when stuck is e.g.:

 127.0.0.1     | ClientRead | idle   | select extract(year from CURRENT_DATE) as year, extract(month from CURRENT_DATE) as month, extract(day from CURRENT_DATE) as day, min(ts), max(ts), count(*) as count, sum(case when soil_status='WET' then 1 else 0 end) as wet, sum(case when soil_status='DRY' then 1 else 0 end) as dry from t5_valvebox where no=6 and ts>=CURRENT_DATE+interval '6 hour' and ts<CURRENT_DATE+interval '7 hour'
 127.0.0.1     | ClientRead | idle   | select count(*) as count, percentile_disc(0.2) within group (order by pond_mm) as pond_mm  FROM t5_valvebox where ts>=now()-interval '3 hour' and no=6
 127.0.0.1     | ClientRead | idle   | select extract(year from CURRENT_DATE) as year, extract(month from CURRENT_DATE) as month, extract(day from CURRENT_DATE) as day, min(ts), max(ts), count(*) as count, sum(case when soil_status='WET' then 1 else 0 end) as wet, sum(case when soil_status='DRY' then 1 else 0 end) as dry from t5_valvebox where no=6 and ts>=CURRENT_DATE+interval '6 hour' and ts<CURRENT_DATE+interval '7 hour'
 127.0.0.1     | ClientRead | idle   | select extract(year from CURRENT_DATE) as year, extract(month from CURRENT_DATE) as month, extract(day from CURRENT_DATE) as day, min(ts), max(ts), count(*) as count, sum(case when soil_status='WET' then 1 else 0 end) as wet, sum(case when soil_status='DRY' then 1 else 0 end) as dry from t5_valvebox where no=6 and ts>=CURRENT_DATE+interval '6 hour' and ts<CURRENT_DATE+interval '7 hour'
 127.0.0.1     | ClientRead | idle   | select count(*) as count, percentile_disc(0.2) within group (order by pond_mm) as pond_mm  FROM t5_valvebox where ts>=now()-interval '3 hour' and no=6
 127.0.0.1     | ClientRead | idle   | select extract(year from CURRENT_DATE) as year, extract(month from CURRENT_DATE) as month, extract(day from CURRENT_DATE) as day, min(ts), max(ts), count(*) as count, sum(case when soil_status='WET' then 1 else 0 end) as wet, sum(case when soil_status='DRY' then 1 else 0 end) as dry from t5_valvebox where no=6 and ts>=CURRENT_DATE+interval '6 hour' and ts<CURRENT_DATE+interval '7 hour'
 127.0.0.1     | ClientRead | idle   | select count(*) as count, percentile_disc(0.2) within group (order by pond_mm) as pond_mm  FROM t5_valvebox where ts>=now()-interval '3 hour' and no=6
 127.0.0.1     | ClientRead | active | select extract(year from CURRENT_DATE) as year, extract(month from CURRENT_DATE) as month, extract(day from CURRENT_DATE) as day, min(ts), max(ts), count(*) as count, sum(case when soil_status='WET' then 1 else 0 end) as wet, sum(case when soil_status='DRY' then 1 else 0 end) as dry from t5_valvebox where no=6 and ts>=CURRENT_DATE+interval '6 hour' and ts<CURRENT_DATE+interval '7 hour'
 127.0.0.1     | ClientRead | active | select count(*) as count, percentile_disc(0.2) within group (order by pond_mm) as pond_mm  FROM t5_valvebox where ts>=now()-interval '3 hour' and no=6
 127.0.0.1     | ClientRead | idle   | select extract(year from CURRENT_DATE) as year, extract(month from CURRENT_DATE) as month, extract(day from CURRENT_DATE) as day, min(ts), max(ts), count(*) as count, sum(case when soil_status='WET' then 1 else 0 end) as wet, sum(case when soil_status='DRY' then 1 else 0 end) as dry from t5_valvebox where no=6 and ts>=CURRENT_DATE+interval '6 hour' and ts<CURRENT_DATE+interval '7 hour'

It seems as if the connection pool does not get cleaned up properly under certain conditions ...

wangnick commented 2 years ago

It seems as if the connection pool does not get cleaned up properly under certain conditions ...

My bad. For exactly those nodes firing the queries listed above as dangling I activated "Split results in multiple messages" to ease my post-processing of the single row returned, not recognising that this also activates backpressure management. So seemingly the node was waiting to receive a tick such that it would only then return the connection to the pool with no further rows available. Sorry for the noise ...

Kind regards, Sebastian

Alkarex commented 2 years ago

@wangnick Thanks for the update 👍🏻 Let me know if you spot another issue

Alkarex commented 2 years ago

A performance issue has been fixed by https://github.com/alexandrainst/node-red-contrib-postgresql/pull/25 , which might also positively impact the problems discussed here

alexandrainst / node-red-contrib-postgresql

Server connection pool getting filled by idle "wait: ClientRead" connections when sending queries rapidly #21