citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.57k stars 670 forks source link

Real-Time select queries might get stuck and never finish under high load #1799

Closed onderkalaci closed 4 years ago

onderkalaci commented 6 years ago

Steps to re-produce (almost every time it happens):

CREATE TYPE complex AS (
    r       double precision,
    i       double precision
);

SELECT run_command_on_workers('
CREATE TYPE complex AS (
    r       double precision,
    i       double precision
);');

CREATE TABLE test_table_1 
    (key int, 
     occurred_at timestamp DEFAULT now(), 
     value_1 jsonb, 
     value_2 text[], 
     value_3 int4range, 
     value_4 complex NOT NULL);
SELECT create_distributed_table('test_table_1', 'key');

fab pg.set_config:max_wal_size,"'5GB'" fab pg.set_config:max_connections,1000

Load some data with pgbench:

\set aid  random(1, 100000)
\set bid  random(1, 100000)
\set cid  random(1, 100000)
\set did  random(1, 100000)

INSERT INTO test_table_1 (key, value_1, value_2, value_3, value_4) VALUES 
                         (:aid,
                          row_to_json(row(:aid,:bid, row(:cid+10, :did+500,:aid *7), row(:cid-10, :did-500,:aid *71))),
                          ARRAY[:aid::text, :bid::text, :cid::text, :did::text, 'Onder Kalaci', 'CitusData Rocks']::text[],
                          int4range(:aid, :aid + :bid),
                            (:aid, :did)::complex);

Then pgbench command:

pgbench -f insert_only.sql -P 3 -c128 -j512 -t 1000 -n

Real-time select pgbench file:

SELECT * FROM test_table_1 LIMIT 10;

pgbench command: pgbench -f select.sql -P 3 -c32 -j32 -t 512 -n

At some point, some of the clients blocks with the following backtrace

#0  0x00007f4aa73bc9d0 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00007f4a9f74dd77 in MultiClientWait (waitInfo=waitInfo@entry=0x263de98) at executor/multi_client_executor.c:822
#2  0x00007f4a9f74f343 in MultiRealTimeExecute (job=job@entry=0x25d2778) at executor/multi_real_time_executor.c:177
#3  0x00007f4a9f74e4aa in RealTimeExecScan (node=0x2557f30) at executor/multi_executor.c:274
#4  0x000000000063178d in ExecCustomScan (pstate=0x2557f30) at nodeCustom.c:118
#5  0x00000000006239f0 in ExecProcNodeFirst (node=0x2557f30) at execProcnode.c:430
#6  0x0000000000638570 in ExecProcNode (node=0x2557f30) at ../../../src/include/executor/executor.h:250
#7  ExecLimit (pstate=0x2557778) at nodeLimit.c:95
#8  0x00000000006239f0 in ExecProcNodeFirst (node=0x2557778) at execProcnode.c:430
#9  0x000000000061ec76 in ExecProcNode (node=0x2557778) at ../../../src/include/executor/executor.h:250
#10 ExecutePlan (estate=estate@entry=0x2557560, planstate=0x2557778, use_parallel_mode=0 '\000', operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=1 '\001', numberTuples=numberTuples@entry=0, 
    direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x26a7d68, execute_once=execute_once@entry=1 '\001') at execMain.c:1721
#11 0x000000000061fb49 in standard_ExecutorRun (queryDesc=0x259aa80, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at execMain.c:363
#12 0x000000000061fbc5 in ExecutorRun (queryDesc=queryDesc@entry=0x259aa80, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=<optimized out>) at execMain.c:306
#13 0x00000000007626b8 in PortalRunSelect (portal=portal@entry=0x2590a60, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x26a7d68) at pquery.c:932
#14 0x0000000000763cd4 in PortalRun (portal=portal@entry=0x2590a60, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x26a7d68, 
    altdest=altdest@entry=0x26a7d68, completionTag=completionTag@entry=0x7fff09b00f30 "") at pquery.c:773
#15 0x00000000007603e2 in exec_simple_query (query_string=query_string@entry=0x24ecdc0 "SELECT * FROM test_table_1 LIMIT 10;") at postgres.c:1099
#16 0x000000000076200a in PostgresMain (argc=<optimized out>, argv=argv@entry=0x246ee28, dbname=0x246ed40 "ec2-user", username=<optimized out>) at postgres.c:4090
#17 0x00000000006eaad7 in BackendRun (port=port@entry=0x24654d0) at postmaster.c:4357
#18 0x00000000006ec8c5 in BackendStartup (port=port@entry=0x24654d0) at postmaster.c:4029
#19 0x00000000006ecb6e in ServerLoop () at postmaster.c:1753
#20 0x00000000006ede10 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x24228c0) at postmaster.c:1361
#21 0x000000000066372f in main (argc=3, argv=0x24228c0) at main.c:228

Some random notes:

0 0x00007fa31a33e9d0 in __poll_nocancel () from /lib64/libc.so.6

1 0x00007fa3126cfd77 in MultiClientWait (waitInfo=waitInfo@entry=0x21a2a90) at executor/multi_client_executor.c:822

2 0x00007fa3126d1343 in MultiRealTimeExecute (job=job@entry=0x20cc6f8) at executor/multi_real_time_executor.c:177

3 0x00007fa3126d04aa in RealTimeExecScan (node=0x21200d0) at executor/multi_executor.c:274

4 0x000000000063178d in ExecCustomScan (pstate=0x21200d0) at nodeCustom.c:118

5 0x00000000006239f0 in ExecProcNodeFirst (node=0x21200d0) at execProcnode.c:430

6 0x0000000000638570 in ExecProcNode (node=0x21200d0) at ../../../src/include/executor/executor.h:250

7 ExecLimit (pstate=0x211f918) at nodeLimit.c:95

8 0x00000000006239f0 in ExecProcNodeFirst (node=0x211f918) at execProcnode.c:430

9 0x000000000061ec76 in ExecProcNode (node=0x211f918) at ../../../src/include/executor/executor.h:250

10 ExecutePlan (estate=estate@entry=0x211f700, planstate=0x211f918, use_parallel_mode=0 '\000', operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=1 '\001', numberTuples=numberTuples@entry=0,

direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x2105ba0, execute_once=execute_once@entry=1 '\001') at execMain.c:1721

11 0x000000000061fb49 in standard_ExecutorRun (queryDesc=0x209ad90, direction=ForwardScanDirection, count=0, execute_once=) at execMain.c:363

12 0x000000000061fbc5 in ExecutorRun (queryDesc=queryDesc@entry=0x209ad90, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=) at execMain.c:306

13 0x00000000007626b8 in PortalRunSelect (portal=portal@entry=0x20c0f58, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x2105ba0) at pquery.c:932

14 0x0000000000763cd4 in PortalRun (portal=portal@entry=0x20c0f58, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=0 '\000', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2105ba0,

altdest=altdest@entry=0x2105ba0, completionTag=completionTag@entry=0x0) at pquery.c:773

15 0x00007fa3126ce9c6 in ExecuteIntoDestReceiver (query=query@entry=0x2104268, params=params@entry=0x0, dest=dest@entry=0x2105ba0) at executor/insert_select_executor.c:181

16 0x00007fa3126ceae8 in ExecuteSelectIntoRelation (targetRelationId=targetRelationId@entry=16740, insertTargetList=insertTargetList@entry=0x2105030, selectQuery=selectQuery@entry=0x2104268,

executorState=executorState@entry=0x2098d80) at executor/insert_select_executor.c:140

17 0x00007fa3126cebb8 in CoordinatorInsertSelectExecScan (node=0x2098f98) at executor/insert_select_executor.c:75

18 0x000000000063178d in ExecCustomScan (pstate=0x2098f98) at nodeCustom.c:118

19 0x00000000006239f0 in ExecProcNodeFirst (node=0x2098f98) at execProcnode.c:430

20 0x000000000061ec76 in ExecProcNode (node=0x2098f98) at ../../../src/include/executor/executor.h:250

21 ExecutePlan (estate=estate@entry=0x2098d80, planstate=0x2098f98, use_parallel_mode=0 '\000', operation=operation@entry=CMD_INSERT, sendTuples=sendTuples@entry=0 '\000', numberTuples=numberTuples@entry=0,

direction=direction@entry=ForwardScanDirection, dest=dest@entry=0x2101348, execute_once=execute_once@entry=1 '\001') at execMain.c:1721

22 0x000000000061fb49 in standard_ExecutorRun (queryDesc=0x1fad6a0, direction=ForwardScanDirection, count=0, execute_once=execute_once@entry=1 '\001') at execMain.c:363

23 0x000000000061fbc5 in ExecutorRun (queryDesc=queryDesc@entry=0x1fad6a0, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=execute_once@entry=1 '\001') at execMain.c:306

24 0x0000000000762d45 in ProcessQuery (plan=plan@entry=0x2101028, sourceText=, params=0x0, queryEnv=0x0, dest=dest@entry=0x2101348, completionTag=completionTag@entry=0x7ffc6ce39e00 "")

at pquery.c:161

25 0x0000000000762f74 in PortalRunMulti (portal=portal@entry=0x20c0e40, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=dest@entry=0x2101348,

altdest=altdest@entry=0x2101348, completionTag=completionTag@entry=0x7ffc6ce39e00 "") at pquery.c:1286

26 0x0000000000763d70 in PortalRun (portal=portal@entry=0x20c0e40, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2101348,

altdest=altdest@entry=0x2101348, completionTag=completionTag@entry=0x7ffc6ce39e00 "") at pquery.c:799

27 0x00000000007603e2 in exec_simple_query (query_string=query_string@entry=0x202e1d0 "INSERT INTO test_table_1 SELECT * FROM test_table_1 LIMIT 10;") at postgres.c:1099

28 0x000000000076200a in PostgresMain (argc=, argv=argv@entry=0x1fafe28, dbname=0x1fafd40 "ec2-user", username=) at postgres.c:4090

29 0x00000000006eaad7 in BackendRun (port=port@entry=0x1fa64d0) at postmaster.c:4357

30 0x00000000006ec8c5 in BackendStartup (port=port@entry=0x1fa64d0) at postmaster.c:4029

31 0x00000000006ecb6e in ServerLoop () at postmaster.c:1753

32 0x00000000006ede10 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1f638c0) at postmaster.c:1361

33 0x000000000066372f in main (argc=3, argv=0x1f638c0) at main.c:228


And we get the following output from pgbench which never finishes:

[ec2-user@ip-10-192-0-212 ~]$ pgbench -f insert_select.sql -P 3 -c32 -j32 -t 512 starting vacuum...ERROR: relation "pgbench_branches" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_tellers" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_history" does not exist (ignoring this error and continuing anyway) end. progress: 3.0 s, 214.7 tps, lat 132.501 ms stddev 246.319 progress: 6.0 s, 224.3 tps, lat 155.069 ms stddev 221.287 progress: 9.0 s, 222.3 tps, lat 135.148 ms stddev 187.363 progress: 12.0 s, 219.0 tps, lat 151.494 ms stddev 205.526 progress: 15.0 s, 222.0 tps, lat 141.065 ms stddev 169.540 progress: 18.0 s, 222.7 tps, lat 137.910 ms stddev 197.463 progress: 21.0 s, 222.6 tps, lat 141.703 ms stddev 220.887 progress: 24.0 s, 224.4 tps, lat 148.118 ms stddev 215.877 progress: 27.0 s, 224.3 tps, lat 138.928 ms stddev 190.332 WARNING: could not establish asynchronous connection after 5000 ms progress: 30.0 s, 222.4 tps, lat 149.898 ms stddev 269.491 progress: 33.0 s, 220.9 tps, lat 142.246 ms stddev 185.783 progress: 36.0 s, 223.4 tps, lat 147.924 ms stddev 206.710 progress: 39.0 s, 221.0 tps, lat 144.701 ms stddev 210.550 progress: 42.0 s, 221.7 tps, lat 141.649 ms stddev 186.612 progress: 45.0 s, 225.0 tps, lat 140.926 ms stddev 222.316 progress: 48.0 s, 222.6 tps, lat 146.699 ms stddev 198.950 progress: 51.0 s, 225.0 tps, lat 143.601 ms stddev 222.813 progress: 54.0 s, 224.4 tps, lat 136.603 ms stddev 184.196 progress: 57.0 s, 219.7 tps, lat 145.607 ms stddev 232.662 progress: 60.0 s, 226.4 tps, lat 135.318 ms stddev 209.489 progress: 63.0 s, 224.3 tps, lat 137.365 ms stddev 225.198 progress: 66.0 s, 224.3 tps, lat 132.615 ms stddev 156.285 progress: 69.0 s, 222.7 tps, lat 115.979 ms stddev 142.710 progress: 72.0 s, 223.3 tps, lat 71.027 ms stddev 22.280

lisuml commented 6 years ago

Confirmed in our installation. Under heavy load, our ingestor is resulting in some data not being ingested. We also see a lot of WARNING: could not establish asynchronous connection after 3000 ms when the issue is happening.

metdos commented 6 years ago

Hi @lisuml, which Citus and PostgreSQL version are you using? Are there any other problems?

metdos commented 6 years ago

Note that -n option removes these error messages starting vacuum...ERROR: relation "pgbench_branches" does not exist, but I can replicate the problem with our test-automation framework but not locally.

lisuml commented 6 years ago

The issue happens regularly on our cluster with Citus 7.0.3 and PostgreSQL 10.0 installed. I have just upgraded Citus to 7.1.0 and I'll update the thread with more info soon. It never happened on Citus 5 with PostgreSQL 9.5 with the same volume of queries.

lisuml commented 6 years ago

Ah, and we use citus enterprise btw.

metdos commented 6 years ago

Adding more findings, it seems that this query is waiting on the worker node.

From pg_stat_activity

datid            | 16384
datname          | ec2-user
pid              | 18508
usesysid         | 10
usename          | ec2-user
application_name | citus
client_addr      | 10.192.0.148
client_hostname  |
client_port      | 49685
backend_start    | 2017-11-21 13:23:18.339031+00
xact_start       |
query_start      | 2017-11-21 13:23:18.341751+00
state_change     | 2017-11-21 13:23:18.344189+00
wait_event_type  | Client
wait_event       | ClientRead
state            | idle
backend_xid      |
backend_xmin     |
query            | COPY (SELECT key, occurred_at, value_1, value_2, value_3, value_4 FROM test_table_1_102014 test_table_1 WHERE true LIMIT '10'::bigint) TO STDOUT
backend_type     | client backend

And backtrace of the COPY process

#0  0x00007f96647a18c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x000000000073d075 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffc95141470, cur_timeout=-1, set=0x17f98d8) at latch.c:1048
#2  WaitEventSetWait (set=0x17f98d8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffc95141470, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296) at latch.c:1000
#3  0x0000000000653ba1 in secure_read (port=0x183a7b0, ptr=0xd59540 <PqRecvBuffer>, len=8192) at be-secure.c:166
#4  0x000000000065d957 in pq_recvbuf () at pqcomm.c:963
#5  0x000000000065e46f in pq_getbyte () at pqcomm.c:1006
#6  0x000000000075e125 in SocketBackend (inBuf=inBuf@entry=0x7ffc95141630) at postgres.c:328
#7  0x000000000075f9dd in ReadCommand (inBuf=inBuf@entry=0x7ffc95141630) at postgres.c:501
#8  0x000000000076296f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1844050, dbname=0x1843ed0 "ec2-user", username=<optimized out>) at postgres.c:4030
#9  0x00000000006eb397 in BackendRun (port=port@entry=0x183a7b0) at postmaster.c:4357
#10 0x00000000006ed185 in BackendStartup (port=port@entry=0x183a7b0) at postmaster.c:4029
#11 0x00000000006ed42e in ServerLoop () at postmaster.c:1753
#12 0x00000000006ee6d0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x17f78b0) at postmaster.c:1361
#13 0x0000000000663eff in main (argc=3, argv=0x17f78b0) at main.c:228
[ec2-user@ip-10-192-0-207 ~]$  strace -p 18508
Process 18508 attached
epoll_wait(3,
metdos commented 6 years ago

Could be related to #1658

metdos commented 6 years ago

Today, I also tried to replicate the problem with local disks but failed to do. I spun a cluster of c3.4xlarge instead of c4.4xlarge. The only notable difference is that c4.4xlarge uses EBS volumes where c3.4xlarge uses local SSDs. This problem could be somehow related to using EBS volumes.

For select, the throughput was a bit lower for c3.4xlarge, so I also tried with c3.8xlarge. I can consistently replicate the problem with c4.4xlarge, but not even once with c3.4xlarge and c3.8xlarge.

Throughput numbers for reference:

pgbench -f select.sql -P 1 -c32 -j32 -t 512 -n
c3.4xlarge: tps = 196
c3.8xlarge: tps = 269
c4.4xlarge: tps = 235 (rarely it does not fail)
metdos commented 6 years ago

I added configurations to test-automation to easily replicate this.

As noted here; https://github.com/citusdata/test-automation#-running-pgbench-tests

Differently, you need to use c4.4xlarge instances, as noted above, the problem is not replicable on c3.4xlarge.

cloudformation/create-stack.sh -k your-keypair -s PgBenchFormation -n 3 -i c4.4xlarge

Then connect to coordinator and just run the pgbench tests with this command;

fab run.pgbench_tests:pgbench_issue_1799.ini
jmehnle commented 6 years ago

After a recent upgrade to Pg 9.5 / Citus 6.2 (from Pg 9.3 / CitusDB 3) we're seeing 1. general unreliability of some real-world heavy distributed queries and 2. chunks of the same WARNING: could not establish asynchronous connection after 5000 ms messages. I'm a little overwhelmed with all the bits of information provided above and am unsure as to which bits are relevant to the situation, but here's some info on our setup (all on AWS):

1 coordinator on c3.4xlarge, 6 workers on m3.xlarge. Everything is running Pg 9.5 with Citus Enterprise 6.2. We, too, are using the real-time executor for all of our distributed queries. The number of shards involved in most of our queries ranges in the low dozens to low hundreds. We've aggressively bumped our file descriptor and open connections limits to accommodate, so that is not a limiting factor (and it was working fine on CitusDB 3.0 just a few weeks ago).

This is going on almost all the time, except for periods of low activity:

db-hubble-master-00:~> zcat -f /var/log/postgresql/postgresql-9.5-main.log* | grep 'WARNING:  could not establish asynchronous connection after 5000 ms' | pcregrep -o1 '^([\d-]{10} [\d:]{3})' | sort -nr | uniq -c
   2138 2017-12-25 09:
  13704 2017-12-25 08:
  28477 2017-12-25 07:
  15056 2017-12-25 06:
  18628 2017-12-25 05:
  19394 2017-12-25 04:
   1230 2017-12-25 03:
  13410 2017-12-25 02:
    997 2017-12-25 01:
    194 2017-12-25 00:
    319 2017-12-24 23:
    229 2017-12-24 22:
    123 2017-12-24 08:
    739 2017-12-24 07:
   1872 2017-12-24 06:
     39 2017-12-24 04:
   9468 2017-12-24 03:
   7887 2017-12-24 02:
    678 2017-12-24 01:
   6005 2017-12-24 00:
    141 2017-12-23 23:
     44 2017-12-23 22:
      1 2017-12-23 15:
    189 2017-12-23 09:
   3721 2017-12-23 08:
...

I'm collecting more information on the nature of the queries suffering from this, but meanwhile please let me know what other information I can provide.

sumedhpathak commented 6 years ago

Hi @jmehnle ,

We're treating this issue as high priority, and we'll hope to have some of this resolved soon. When you mention 'general unreliability', are you seeing queries fail, or do they hang?

metdos commented 6 years ago

Hi @jmehnle, I created a private Slack channel to communicate given that some of the information we ask would be a bit sensitive, and it provides a faster way of communication.

jmehnle commented 6 years ago

I'm talking to you on that private Slack channel now.

To add a bit more information that may be generally relevant to this issue: we are actually seeing queries fail with ERROR: failed to execute task NNN that were not failing under CitusDB 3, albeit with a much lower frequency than WARNING: could not establish asynchronous connection after 5000 ms:

db-hubble-master-00:~> zcat -f /var/log/postgresql/postgresql-9.5-main.log* | grep 'ERROR:  failed to execute task' | pcregrep -o1 '^([\d-]{10} [\d:]{3})' | sort -nr | uniq -c
     16 2017-12-26 04:
     16 2017-12-25 04:
   1792 2017-12-24 04:
      1 2017-12-23 04:
     24 2017-12-22 04:
      6 2017-12-21 09:
     84 2017-12-21 04:
      5 2017-12-20 04:
…

The query plan for one frequently (but not reliably) failing query is:

 HashAggregate  (cost=0.00..0.00 rows=0 width=0) (actual time=275887.337..276904.740 rows=479924 loops=1)
   Group Key: remote_scan.ip
   ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0) (actual time=231376.486..233931.442 rows=10574717 loops=1)
         Task Count: 109
         Tasks Shown: One of 109
         ->  Task
               Node: host=db-hubble-data-00 port=5432 dbname=hubble
               ->  HashAggregate  (cost=10535.27..10536.07 rows=11 width=32) (actual time=3168.415..3168.415 rows=0 loops=1)
                     Group Key: ip
                     ->  Index Scan using index_email_aggregate_on_ts_125795 on email_aggregate_125795 e  (cost=0.43..10374.85 rows=1035 width=32) (actual time=3168.412..3168.412 rows=0 loops=1)
                           Index Cond: ((ts >= '2017-11-28 00:00:00'::timestamp without time zone) AND (ts < '2017-12-19 00:00:00'::timestamp without time zone))
                           Filter: ((policy_result_spf <> 'p'::text) AND (auth_result_dkim <> 'n'::text) AND (hdr_from = ANY ('{facebookmail.com,linkedin.com,twitter.com,aol.com,google.com,gmail.com,hotmail.com,r.groupon.com,icloud.com,…}'))
                           Rows Removed by Filter: 105666
                   Planning time: 47.999 ms
                   Execution time: 3168.901 ms
 Planning time: 82.593 ms
 Execution time: 277121.909 ms

If this sounds unrelated to the original issue reported here, I'll gladly open a new issue. Let me know if that's the case.

pykello commented 6 years ago

I've been working fulltime on this for the last 2 days. I haven't found the root of the issue yet, but here are some observations:

  1. I did some iperf3 tests to make sure the connection between coordinator and workers are reliable. Everything seemed fine. I compared iperf3 results between c3.4xlarge and c4.4xlarge and the bandwidth and number of retries were similar.
  2. After getting stuck, I did some tcpdump on coordinator and the worker for which we were waiting for. From quick glance, I think there were some packets that were sent from worker but not received in master node. I am not sure about my observation yet, but this is something I am going to inspect much more. I did this end of day yesterday, but left it for later since I need to learn tcpdump before I am sure I'm doing this right. I'll keep you updated.
  3. Last paragraph of Async Command Processing in libpq says:

After sending any command or data on a nonblocking connection, call PQflush. If it returns 1, wait for the socket to become read- or write-ready. If it becomes write-ready, call PQflush again. If it becomes read-ready, call PQconsumeInput, then call PQflush again. Repeat until PQflush returns 0. .. Once PQflush returns 0, wait for the socket to be read-ready and then read the response as described above.

I think for COPY data, we are doing something similar, but not exactly this. After sending a command, we are not doing this at all. So I thought this might be the reason for what we are experiencing. i.e. if we don't do this then client might wait for some more I/O from server before it can start execution, but since we are not doing this it will just wait and execute nothing. I did a quick implementation of this paragraph in SendRemoteCommand(). The problem didn't go away. I will go over my code again to see if I've missed something. I will also go over libpq docs one more time to see if we have missed similar guidelines. I'll keep you updated.

  1. If you wait enough, the pgbench command will finish in about 15 minutes after getting stuck. When it finishes, we see the Citus warning WARNING: could not consume data from worker node. on the coordinator node, and the libpq LOG: could not receive data from client: Connection timed out. Client is getting the ETIMEDOUT error.

  2. The stack trace in the client when query is stuck is:

    #1  0x00000000006df2de in WaitEventSetWait ()
    #2  0x000000000061072b in secure_read ()
    #3  0x0000000000618fc8 in pq_recvbuf ()
    #4  0x0000000000619b95 in pq_getbyte ()
    #5  0x00000000006feae2 in PostgresMain ()
    #6  0x00000000004778df in ServerLoop ()
    #7  0x000000000069a139 in PostmasterMain ()
    #8  0x00000000004782fb in main ()
  3. From 4 and 5, the client is actually busy and is waiting for more bytes so it can execute the command, but it is not receiving anything and times out after a long time. I guess this is either because of (a) there is a logical problem in coordinator/worker problem, which I suspect might be were not using libpq properly, or (b) something network related fails and a packet is missed.

For 6.a. I am going over libpq docs and comparing it with our code to see what we have missed. For 6.b. I am thinking of writing a script (or a sql query) to process tcpdump's output to see if all communication between coordinator and worker are ok.

I'll keep you updated.

pykello commented 6 years ago

Also, when I add some fprintf(stderr, ...) to the code (because I want to avoid using elog() which does more than just logging and might check for interrupts, etc.), the problem becomes less frequent and pgbench sometimes finishes. If the reason of the issue is packet drop because of congestion, then frequent fpeintf() might have been helping to slow down things a little bit and reduce congestion. I need to look into this too.

No conclusion, just thought I might share this in case someone has an idea.

pykello commented 6 years ago

Don't mind the 6 in two posts earlier. Client is actually waiting for the first byte of the protocol by calling pq_getbyte() to read the next command.

pykello commented 6 years ago

Some status updates :)

Summary

TL; DR; from what I understood until now, this happens when a TCP error happens in the connection. Then some data is queued in socket's os queues that are waiting to be acked or sent, but they are never acked or sent. From libpq's perspective, everything is fine and no errors have been reported to it. When I do getsockopt (..., SO_ERROR, ...), it doesn't return any error.

Clarification about the problem

When we are stuck,

Difference between c3.4xlarge and c4.4xlarge

So the problem happens in c4.4xlarge, but not c3.4xlarge instance types. One of their differences is that some kinds of network errors are more frequent in c4.4xlarge.

While running pgbench, TCP error activity report in c4.4xlarge is:


02:14:44 PM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
02:15:04 PM      0.00          0.05    144.62  0.00       0.40
02:15:24 PM      0.00          0.00    134.23  0.05       0.36
02:15:44 PM      0.00          0.00    100.25  0.00       0.66
02:16:04 PM      0.00          0.10      6.19    0.00       0.20
Average:         0.00          0.04     96.24   0.01       0.40

The same in c3.4xlarge is:

[ec2-user@ip-10-192-0-182 ~]$ sar -n ETCP 20 4
Linux 4.9.62-21.56.amzn1.x86_64 (ip-10-192-0-182)     01/08/2018     _x86_64_    (16 CPU)

06:27:50 PM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
06:28:10 PM      0.00      0.05    141.15      0.00      0.05
06:28:30 PM      0.00      0.00    151.36      0.00      0.00
06:28:50 PM      0.00      0.00    144.55      0.00      0.00
06:29:10 PM      0.00      0.00    163.36      0.00      0.00
Average:         0.00      0.01    150.11      0.00      0.01

As you see, estres/s is zero in c3.4xlarge, but non-zero in c4.4xlarge. Also, orsts/s is much higher in c4.4xlarge. Their definitions according to man sar are:

Difference between a stuck and successful tasks

I queried socket status for both stuck and non-stuck tasks using the ss command.

stuck task Coordinator:

tcp    ESTAB      0      146    10.192.0.244:37742                10.192.0.88:postgres              users:
(("postgres",pid=13893,fd=42)) timer:(on,1min51sec,11) uid:500 ino:95238965 sk:4be2 <->
skmem:(r0,rb233592,t0,tb332800,f1792,w2304,o0,bl0) ts sack cubic wscale:7,7 rto:120000 
backoff:11 rtt:0.411/0.413 ato:40 mss:8949 cwnd:1 ssthresh:7 bytes_acked:94 bytes_received:323 
segs_out:18 segs_in:4 send 174.2Mbps lastsnd:8380 lastrcv:343376 lastack:343376 pacing_rate 
2089.0Mbps unacked:1 retrans:1/12 lost:1 rcv_space:26847

Worker:

tcp    ESTAB      0      2511   10.192.0.88:postgres             10.192.0.244:37742                 users:
(("postgres",pid=94519,fd=9)) timer:(on,25sec,10) uid:500 ino:59128786 sk:54 <->
         skmem:(r0,rb233880,t0,tb332800,f801,w3295,o0,bl0) ts sack cubic wscale:7,7 rto:120000 
backoff:10 rtt:0.214/0.09 ato:40 mss:8949 cwnd:1 ssthresh:7 bytes_acked:323 bytes_received:239 
segs_out:26 segs_in:17 send 334.5Mbps lastsnd:94168 lastrcv:308128 lastack:93976 pacing_rate 
667.1Mbps unacked:1 retrans:1/11 lost:1 rcv_space:26847

successful task Coordinator:

tcp    ESTAB      0      0      10.192.0.244:37686                10.192.0.88:postgres              users:
(("postgres",pid=13893,fd=15)) timer:(keepalive,114min,0) uid:500 ino:95238938 sk:4bda <->
         skmem:(r0,rb233592,t0,tb332800,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:204 rtt:0.576/0.742 
ato:40 mss:8949 cwnd:10 ssthresh:7 bytes_acked:240 bytes_received:2845 segs_out:7 segs_in:5 
send 1242.9Mbps lastsnd:343380 lastrcv:343376 lastack:343376 pacing_rate 1489.9Mbps rcv_rtt:4 
rcv_space:26847

Worker:

(("postgres",pid=94510,fd=9)) timer:(keepalive,114min,0) uid:500 ino:59128777 sk:56 <->
         skmem:(r0,rb233880,t0,tb332800,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:208 rtt:5.506/10.693 
ato:40 mss:8949 cwnd:10 ssthresh:7 bytes_acked:2845 bytes_received:239 segs_out:4 segs_in:7 
send 130.0Mbps lastsnd:308132 lastrcv:308136 lastack:308088 pacing_rate 156.0Mbps 
rcv_space:26847

As you can see, when the task is stuck, we have the following:

pg_conn's internal state

For completeness, I also dumped some of pg_conn's internal state variables after the task got stuck:

[16647] nEvents: 0, status: 0, asyncStatus: 1, xactStatus: 0
last_query: COPY (SELECT key, occurred_at, value_1, value_2, value_3, value_4 FROM test_table_102037 test_table WHERE true LIMIT '10'::bigint) TO STDOUT
last_sqlstate: , non-blocking: 1, singleRowMode: 0
errorMessage: 
workBuffer: TimeZone
inBufSize: 16384, inStart: 0, inCursor: 0, inEnd: 0
outBufSize: 16384, outCount: 0, outMsgStart: 1, outMsgEnd: 146
rowBufLen: 32, result: 0, next_result: 0

I didn't see anything interesting there.

pykello commented 6 years ago

After spending some time to understand output of ss more by reading its source code and matching it with relevant parts of Linux kernel source code, I have more information.

If we run ss multiple times, we will see that retrans value increases over time (it is formatted as retransmits_in_flight/total_retransmits). This is because coordinator sends a packet, waits for ack rto (see it in ss output) millisconds, doesn't receive it. Then it increases backoff value by 1. If it is less than net.ipv4.tcp_retries2 config value, doubles the rto (or maxes out at 120 seconds, i.e. next_rto := min(rto * 2, 120000)), then tries retransmitting again. Initial value of rto is calculated as a function of rtt (round trip time), and is around 200ms in the ec2 instances.

This is consistent with what I was seeing. The value for net.ipv4.tcp_retries2 in the ec2 instance is 15. When I calculate, 15 retries will take around 942 seconds. If we wait for pgbench, it finishes with a timeout error in 15-16 minutes.

So, my conclusion for this issue is that this is a network issue and not a PostgreSQL or Citus problem. We should get faster timeouts by decreasing net.ipv4.tcp_retries2. I haven't tried this yet, I will try it tomorrow or the day after and send an update.

Why do the retransmits fail? I'm not sure, since I don't have much (any?) network experience. But it is a OS or a hardware problem (or EC2 virtualization layers?), and not a database problem. I can do more research later.

Numerous WARNING: could not establish asynchronous connection after 5000 ms warnings are a different problem from this issue, maybe I can spend some time understanding them and try to see if we can do something about them.

pykello commented 6 years ago

OK, I tried reducing net.ipv4.tcp_retries2 from 15 to 10. As expected, now pgbench times out about 10 minutes earlier than before. Now, 6 minutes instead of 16 minutes. If I reduce it to 7, pgbench times out at around 2 minutes.

So, I think we have a good understanding of this issue now. The network connection in c4.4xlarge has higher error rate than c3.4xlarge, as seen in output of sar -n ETCP 20 4. This sometimes results in lost TCP packets which cannot be rectified after many retransmits. TCP layer only reports the problem to libpq after it has tried retransmitting for 15-16 minutes (can be configured by decreasing number of retransmits by setting net.ipv4.tcp_retries2). When libpq sees the problems, then we get warnings like WARNING: could not consume data from worker node, as expected.

So I think we are almost done with this issue, unless someone has more questions or concerns.

marcocitus commented 6 years ago

Should we report this on the EC2 forum?

onderkalaci commented 6 years ago

OK, I tried reducing net.ipv4.tcp_retries2 from 15 to 10

The open question that I have is the following: Are we going to instruct our users for doing this? What is the trade-off if we decrease it too much?

pykello commented 6 years ago

On Thu, Jan 11, 2018 at 7:46 AM, Marco Slot notifications@github.com wrote:

Should we report this on the EC2 forum?

Yes, I think we should do that. Currently I am looking at the packets when this issue happened to see if I can understand this more. I'll post it in EC2 forums after that. Maybe tomorrow.

pykello commented 6 years ago

Here is the action plan for this issue at this point:

  1. Post this issue on EC2 forums.
  2. Test this in Citus Cloud to find out which plans are vulnerable.
  3. Do some research on net.ipv4.tcp_retries2 to understand what value should we recommend to our users. With a brief search I did, RFC 1122 suggests retransmission timeout should be at least 100 seconds, which means net.ipv4.tcp_retries2 should be at least 8. On the other hand, in IBM DB2's docs a value of 3 is suggested.

If anyone has more ideas, please post here.

pykello commented 6 years ago

Just for documentation:

  1. netstat -i on coordinator and workers doesn't show any packet drops.
  2. There used to be a Xen bug which caused dropped packets, which is explained a bit here. The workaround in the post didn't solve our problem, so I think our failure isn't related to what is explained there.
  3. Disabling net.ipv4.tcp_tw_reuse and widening net.ipv4.ip_local_port_range made the problem go away, but made the tps much lower (100 transactions per second). So I think the problem went away because of low load, and this is not proper fix. See linux/Documentation/networking/ip-sysctl.txt for explanation of what these config vars are. I tried this because problem happened after some port reuses. After some research, I think that is unrelated.
pykello commented 6 years ago

Posted this on EC2 forums: https://forums.aws.amazon.com/thread.jspa?messageID=825819

fdr commented 6 years ago

Encountered this problem, FYI. It had ramifications.

onderkalaci commented 4 years ago

We've removed real-time executor from the code-base. The solution here is to upgrade Citus to 8.3+

marcocitus commented 4 years ago

Iirc this came down to TCP getting black holed on AWS, so it could still occur with the new executor if that issue still exists, but it does not seem like something we could really fix.