Closed onderkalaci closed 4 years ago
Confirmed in our installation. Under heavy load, our ingestor is resulting in some data not being ingested. We also see a lot of WARNING: could not establish asynchronous connection after 3000 ms
when the issue is happening.
Hi @lisuml, which Citus and PostgreSQL version are you using? Are there any other problems?
Note that -n
option removes these error messages starting vacuum...ERROR: relation "pgbench_branches" does not exist
, but I can replicate the problem with our test-automation framework but not locally.
The issue happens regularly on our cluster with Citus 7.0.3 and PostgreSQL 10.0 installed. I have just upgraded Citus to 7.1.0 and I'll update the thread with more info soon. It never happened on Citus 5 with PostgreSQL 9.5 with the same volume of queries.
Ah, and we use citus enterprise btw.
Adding more findings, it seems that this query is waiting on the worker node.
From pg_stat_activity
datid | 16384
datname | ec2-user
pid | 18508
usesysid | 10
usename | ec2-user
application_name | citus
client_addr | 10.192.0.148
client_hostname |
client_port | 49685
backend_start | 2017-11-21 13:23:18.339031+00
xact_start |
query_start | 2017-11-21 13:23:18.341751+00
state_change | 2017-11-21 13:23:18.344189+00
wait_event_type | Client
wait_event | ClientRead
state | idle
backend_xid |
backend_xmin |
query | COPY (SELECT key, occurred_at, value_1, value_2, value_3, value_4 FROM test_table_1_102014 test_table_1 WHERE true LIMIT '10'::bigint) TO STDOUT
backend_type | client backend
And backtrace of the COPY process
#0 0x00007f96647a18c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000073d075 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffc95141470, cur_timeout=-1, set=0x17f98d8) at latch.c:1048
#2 WaitEventSetWait (set=0x17f98d8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffc95141470, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296) at latch.c:1000
#3 0x0000000000653ba1 in secure_read (port=0x183a7b0, ptr=0xd59540 <PqRecvBuffer>, len=8192) at be-secure.c:166
#4 0x000000000065d957 in pq_recvbuf () at pqcomm.c:963
#5 0x000000000065e46f in pq_getbyte () at pqcomm.c:1006
#6 0x000000000075e125 in SocketBackend (inBuf=inBuf@entry=0x7ffc95141630) at postgres.c:328
#7 0x000000000075f9dd in ReadCommand (inBuf=inBuf@entry=0x7ffc95141630) at postgres.c:501
#8 0x000000000076296f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1844050, dbname=0x1843ed0 "ec2-user", username=<optimized out>) at postgres.c:4030
#9 0x00000000006eb397 in BackendRun (port=port@entry=0x183a7b0) at postmaster.c:4357
#10 0x00000000006ed185 in BackendStartup (port=port@entry=0x183a7b0) at postmaster.c:4029
#11 0x00000000006ed42e in ServerLoop () at postmaster.c:1753
#12 0x00000000006ee6d0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x17f78b0) at postmaster.c:1361
#13 0x0000000000663eff in main (argc=3, argv=0x17f78b0) at main.c:228
[ec2-user@ip-10-192-0-207 ~]$ strace -p 18508
Process 18508 attached
epoll_wait(3,
Could be related to #1658
Today, I also tried to replicate the problem with local disks but failed to do. I spun a cluster of c3.4xlarge instead of c4.4xlarge. The only notable difference is that c4.4xlarge uses EBS volumes where c3.4xlarge uses local SSDs. This problem could be somehow related to using EBS volumes.
For select, the throughput was a bit lower for c3.4xlarge, so I also tried with c3.8xlarge. I can consistently replicate the problem with c4.4xlarge, but not even once with c3.4xlarge and c3.8xlarge.
Throughput numbers for reference:
pgbench -f select.sql -P 1 -c32 -j32 -t 512 -n
c3.4xlarge: tps = 196
c3.8xlarge: tps = 269
c4.4xlarge: tps = 235 (rarely it does not fail)
I added configurations to test-automation to easily replicate this.
As noted here; https://github.com/citusdata/test-automation#-running-pgbench-tests
Differently, you need to use c4.4xlarge instances, as noted above, the problem is not replicable on c3.4xlarge.
cloudformation/create-stack.sh -k your-keypair -s PgBenchFormation -n 3 -i c4.4xlarge
Then connect to coordinator and just run the pgbench tests with this command;
fab run.pgbench_tests:pgbench_issue_1799.ini
After a recent upgrade to Pg 9.5 / Citus 6.2 (from Pg 9.3 / CitusDB 3) we're seeing 1. general unreliability of some real-world heavy distributed queries and 2. chunks of the same WARNING: could not establish asynchronous connection after 5000 ms
messages. I'm a little overwhelmed with all the bits of information provided above and am unsure as to which bits are relevant to the situation, but here's some info on our setup (all on AWS):
1 coordinator on c3.4xlarge
, 6 workers on m3.xlarge
. Everything is running Pg 9.5 with Citus Enterprise 6.2. We, too, are using the real-time executor for all of our distributed queries. The number of shards involved in most of our queries ranges in the low dozens to low hundreds. We've aggressively bumped our file descriptor and open connections limits to accommodate, so that is not a limiting factor (and it was working fine on CitusDB 3.0 just a few weeks ago).
This is going on almost all the time, except for periods of low activity:
db-hubble-master-00:~> zcat -f /var/log/postgresql/postgresql-9.5-main.log* | grep 'WARNING: could not establish asynchronous connection after 5000 ms' | pcregrep -o1 '^([\d-]{10} [\d:]{3})' | sort -nr | uniq -c
2138 2017-12-25 09:
13704 2017-12-25 08:
28477 2017-12-25 07:
15056 2017-12-25 06:
18628 2017-12-25 05:
19394 2017-12-25 04:
1230 2017-12-25 03:
13410 2017-12-25 02:
997 2017-12-25 01:
194 2017-12-25 00:
319 2017-12-24 23:
229 2017-12-24 22:
123 2017-12-24 08:
739 2017-12-24 07:
1872 2017-12-24 06:
39 2017-12-24 04:
9468 2017-12-24 03:
7887 2017-12-24 02:
678 2017-12-24 01:
6005 2017-12-24 00:
141 2017-12-23 23:
44 2017-12-23 22:
1 2017-12-23 15:
189 2017-12-23 09:
3721 2017-12-23 08:
...
I'm collecting more information on the nature of the queries suffering from this, but meanwhile please let me know what other information I can provide.
Hi @jmehnle ,
We're treating this issue as high priority, and we'll hope to have some of this resolved soon. When you mention 'general unreliability', are you seeing queries fail, or do they hang?
Hi @jmehnle, I created a private Slack channel to communicate given that some of the information we ask would be a bit sensitive, and it provides a faster way of communication.
I'm talking to you on that private Slack channel now.
To add a bit more information that may be generally relevant to this issue: we are actually seeing queries fail with ERROR: failed to execute task NNN
that were not failing under CitusDB 3, albeit with a much lower frequency than WARNING: could not establish asynchronous connection after 5000 ms
:
db-hubble-master-00:~> zcat -f /var/log/postgresql/postgresql-9.5-main.log* | grep 'ERROR: failed to execute task' | pcregrep -o1 '^([\d-]{10} [\d:]{3})' | sort -nr | uniq -c
16 2017-12-26 04:
16 2017-12-25 04:
1792 2017-12-24 04:
1 2017-12-23 04:
24 2017-12-22 04:
6 2017-12-21 09:
84 2017-12-21 04:
5 2017-12-20 04:
…
The query plan for one frequently (but not reliably) failing query is:
HashAggregate (cost=0.00..0.00 rows=0 width=0) (actual time=275887.337..276904.740 rows=479924 loops=1)
Group Key: remote_scan.ip
-> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=231376.486..233931.442 rows=10574717 loops=1)
Task Count: 109
Tasks Shown: One of 109
-> Task
Node: host=db-hubble-data-00 port=5432 dbname=hubble
-> HashAggregate (cost=10535.27..10536.07 rows=11 width=32) (actual time=3168.415..3168.415 rows=0 loops=1)
Group Key: ip
-> Index Scan using index_email_aggregate_on_ts_125795 on email_aggregate_125795 e (cost=0.43..10374.85 rows=1035 width=32) (actual time=3168.412..3168.412 rows=0 loops=1)
Index Cond: ((ts >= '2017-11-28 00:00:00'::timestamp without time zone) AND (ts < '2017-12-19 00:00:00'::timestamp without time zone))
Filter: ((policy_result_spf <> 'p'::text) AND (auth_result_dkim <> 'n'::text) AND (hdr_from = ANY ('{facebookmail.com,linkedin.com,twitter.com,aol.com,google.com,gmail.com,hotmail.com,r.groupon.com,icloud.com,…}'))
Rows Removed by Filter: 105666
Planning time: 47.999 ms
Execution time: 3168.901 ms
Planning time: 82.593 ms
Execution time: 277121.909 ms
If this sounds unrelated to the original issue reported here, I'll gladly open a new issue. Let me know if that's the case.
I've been working fulltime on this for the last 2 days. I haven't found the root of the issue yet, but here are some observations:
iperf3
tests to make sure the connection between coordinator and workers are reliable. Everything seemed fine. I compared iperf3
results between c3.4xlarge
and c4.4xlarge
and the bandwidth and number of retries were similar.tcpdump
on coordinator and the worker for which we were waiting for. From quick glance, I think there were some packets that were sent from worker but not received in master node. I am not sure about my observation yet, but this is something I am going to inspect much more. I did this end of day yesterday, but left it for later since I need to learn tcpdump
before I am sure I'm doing this right. I'll keep you updated.After sending any command or data on a nonblocking connection, call PQflush. If it returns 1, wait for the socket to become read- or write-ready. If it becomes write-ready, call PQflush again. If it becomes read-ready, call PQconsumeInput, then call PQflush again. Repeat until PQflush returns 0. .. Once PQflush returns 0, wait for the socket to be read-ready and then read the response as described above.
I think for COPY data, we are doing something similar, but not exactly this. After sending a command, we are not doing this at all. So I thought this might be the reason for what we are experiencing. i.e. if we don't do this then client might wait for some more I/O from server before it can start execution, but since we are not doing this it will just wait and execute nothing. I did a quick implementation of this paragraph in SendRemoteCommand()
. The problem didn't go away. I will go over my code again to see if I've missed something. I will also go over libpq docs one more time to see if we have missed similar guidelines. I'll keep you updated.
If you wait enough, the pgbench
command will finish in about 15 minutes after getting stuck. When it finishes, we see the Citus warning WARNING: could not consume data from worker node.
on the coordinator node, and the libpq LOG: could not receive data from client: Connection timed out
. Client is getting the ETIMEDOUT
error.
The stack trace in the client when query is stuck is:
#1 0x00000000006df2de in WaitEventSetWait ()
#2 0x000000000061072b in secure_read ()
#3 0x0000000000618fc8 in pq_recvbuf ()
#4 0x0000000000619b95 in pq_getbyte ()
#5 0x00000000006feae2 in PostgresMain ()
#6 0x00000000004778df in ServerLoop ()
#7 0x000000000069a139 in PostmasterMain ()
#8 0x00000000004782fb in main ()
From 4 and 5, the client is actually busy and is waiting for more bytes so it can execute the command, but it is not receiving anything and times out after a long time. I guess this is either because of (a) there is a logical problem in coordinator/worker problem, which I suspect might be were not using libpq properly, or (b) something network related fails and a packet is missed.
For 6.a. I am going over libpq docs and comparing it with our code to see what we have missed. For 6.b. I am thinking of writing a script (or a sql query) to process tcpdump's output to see if all communication between coordinator and worker are ok.
I'll keep you updated.
Also, when I add some fprintf(stderr, ...)
to the code (because I want to avoid using elog()
which does more than just logging and might check for interrupts, etc.), the problem becomes less frequent and pgbench
sometimes finishes. If the reason of the issue is packet drop because of congestion, then frequent fpeintf()
might have been helping to slow down things a little bit and reduce congestion. I need to look into this too.
No conclusion, just thought I might share this in case someone has an idea.
Don't mind the 6 in two posts earlier. Client is actually waiting for the first byte of the protocol by calling pq_getbyte()
to read the next command.
Some status updates :)
TL; DR; from what I understood until now, this happens when a TCP error happens in the connection. Then some data is queued in socket's os queues that are waiting to be acked or sent, but they are never acked or sent. From libpq's perspective, everything is fine and no errors have been reported to it. When I do getsockopt (..., SO_ERROR, ...)
, it doesn't return any error.
When we are stuck,
EXEC_COMPUTE_TASK_RUNNING
state, but the connection is always in busy state and results are never ready since libpq always returns PQisBusy()
as true
. We are not stuck in MultiClientWait()
. The reason our backtraces show MultiClientWait()
as the problem is that we have a short sleep there so when we want to get a backtrace the chance of execution being there is much higher than other points of code.
` On worker side, libpq seems to think it is done with the query, has sent the response, and is waiting for the next query to arrive. It is in idle state.So the problem happens in c4.4xlarge, but not c3.4xlarge instance types. One of their differences is that some kinds of network errors are more frequent in c4.4xlarge.
While running pgbench
, TCP error activity report in c4.4xlarge is:
02:14:44 PM atmptf/s estres/s retrans/s isegerr/s orsts/s
02:15:04 PM 0.00 0.05 144.62 0.00 0.40
02:15:24 PM 0.00 0.00 134.23 0.05 0.36
02:15:44 PM 0.00 0.00 100.25 0.00 0.66
02:16:04 PM 0.00 0.10 6.19 0.00 0.20
Average: 0.00 0.04 96.24 0.01 0.40
The same in c3.4xlarge is:
[ec2-user@ip-10-192-0-182 ~]$ sar -n ETCP 20 4
Linux 4.9.62-21.56.amzn1.x86_64 (ip-10-192-0-182) 01/08/2018 _x86_64_ (16 CPU)
06:27:50 PM atmptf/s estres/s retrans/s isegerr/s orsts/s
06:28:10 PM 0.00 0.05 141.15 0.00 0.05
06:28:30 PM 0.00 0.00 151.36 0.00 0.00
06:28:50 PM 0.00 0.00 144.55 0.00 0.00
06:29:10 PM 0.00 0.00 163.36 0.00 0.00
Average: 0.00 0.01 150.11 0.00 0.01
As you see, estres/s
is zero in c3.4xlarge, but non-zero in c4.4xlarge. Also, orsts/s
is much higher in c4.4xlarge. Their definitions according to man sar
are:
I queried socket status for both stuck and non-stuck tasks using the ss
command.
stuck task Coordinator:
tcp ESTAB 0 146 10.192.0.244:37742 10.192.0.88:postgres users:
(("postgres",pid=13893,fd=42)) timer:(on,1min51sec,11) uid:500 ino:95238965 sk:4be2 <->
skmem:(r0,rb233592,t0,tb332800,f1792,w2304,o0,bl0) ts sack cubic wscale:7,7 rto:120000
backoff:11 rtt:0.411/0.413 ato:40 mss:8949 cwnd:1 ssthresh:7 bytes_acked:94 bytes_received:323
segs_out:18 segs_in:4 send 174.2Mbps lastsnd:8380 lastrcv:343376 lastack:343376 pacing_rate
2089.0Mbps unacked:1 retrans:1/12 lost:1 rcv_space:26847
Worker:
tcp ESTAB 0 2511 10.192.0.88:postgres 10.192.0.244:37742 users:
(("postgres",pid=94519,fd=9)) timer:(on,25sec,10) uid:500 ino:59128786 sk:54 <->
skmem:(r0,rb233880,t0,tb332800,f801,w3295,o0,bl0) ts sack cubic wscale:7,7 rto:120000
backoff:10 rtt:0.214/0.09 ato:40 mss:8949 cwnd:1 ssthresh:7 bytes_acked:323 bytes_received:239
segs_out:26 segs_in:17 send 334.5Mbps lastsnd:94168 lastrcv:308128 lastack:93976 pacing_rate
667.1Mbps unacked:1 retrans:1/11 lost:1 rcv_space:26847
successful task Coordinator:
tcp ESTAB 0 0 10.192.0.244:37686 10.192.0.88:postgres users:
(("postgres",pid=13893,fd=15)) timer:(keepalive,114min,0) uid:500 ino:95238938 sk:4bda <->
skmem:(r0,rb233592,t0,tb332800,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:204 rtt:0.576/0.742
ato:40 mss:8949 cwnd:10 ssthresh:7 bytes_acked:240 bytes_received:2845 segs_out:7 segs_in:5
send 1242.9Mbps lastsnd:343380 lastrcv:343376 lastack:343376 pacing_rate 1489.9Mbps rcv_rtt:4
rcv_space:26847
Worker:
(("postgres",pid=94510,fd=9)) timer:(keepalive,114min,0) uid:500 ino:59128777 sk:56 <->
skmem:(r0,rb233880,t0,tb332800,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:208 rtt:5.506/10.693
ato:40 mss:8949 cwnd:10 ssthresh:7 bytes_acked:2845 bytes_received:239 segs_out:4 segs_in:7
send 130.0Mbps lastsnd:308132 lastrcv:308136 lastack:308088 pacing_rate 156.0Mbps
rcv_space:26847
As you can see, when the task is stuck, we have the following:
unacked
packets.wmem_queued
(which is marked is skmem->w
) is non zero. This is the amount of memory used by the socket send buffer queued in the transmit queue and are either not yet sent out or not yet acknowledged.For completeness, I also dumped some of pg_conn's internal state variables after the task got stuck:
[16647] nEvents: 0, status: 0, asyncStatus: 1, xactStatus: 0
last_query: COPY (SELECT key, occurred_at, value_1, value_2, value_3, value_4 FROM test_table_102037 test_table WHERE true LIMIT '10'::bigint) TO STDOUT
last_sqlstate: , non-blocking: 1, singleRowMode: 0
errorMessage:
workBuffer: TimeZone
inBufSize: 16384, inStart: 0, inCursor: 0, inEnd: 0
outBufSize: 16384, outCount: 0, outMsgStart: 1, outMsgEnd: 146
rowBufLen: 32, result: 0, next_result: 0
I didn't see anything interesting there.
After spending some time to understand output of ss
more by reading its source code and matching it with relevant parts of Linux kernel source code, I have more information.
If we run ss
multiple times, we will see that retrans
value increases over time (it is formatted as retransmits_in_flight/total_retransmits
). This is because coordinator sends a packet, waits for ack rto
(see it in ss
output) millisconds, doesn't receive it. Then it increases backoff
value by 1. If it is less than net.ipv4.tcp_retries2
config value, doubles the rto
(or maxes out at 120 seconds, i.e. next_rto := min(rto * 2, 120000)
), then tries retransmitting again. Initial value of rto
is calculated as a function of rtt
(round trip time), and is around 200ms in the ec2 instances.
This is consistent with what I was seeing. The value for net.ipv4.tcp_retries2
in the ec2 instance is 15. When I calculate, 15 retries will take around 942 seconds. If we wait for pgbench
, it finishes with a timeout error in 15-16 minutes.
So, my conclusion for this issue is that this is a network issue and not a PostgreSQL or Citus problem. We should get faster timeouts by decreasing net.ipv4.tcp_retries2
. I haven't tried this yet, I will try it tomorrow or the day after and send an update.
Why do the retransmits fail? I'm not sure, since I don't have much (any?) network experience. But it is a OS or a hardware problem (or EC2 virtualization layers?), and not a database problem. I can do more research later.
Numerous WARNING: could not establish asynchronous connection after 5000 ms
warnings are a different problem from this issue, maybe I can spend some time understanding them and try to see if we can do something about them.
OK, I tried reducing net.ipv4.tcp_retries2
from 15 to 10. As expected, now pgbench times out about 10 minutes earlier than before. Now, 6 minutes instead of 16 minutes. If I reduce it to 7, pgbench
times out at around 2 minutes.
So, I think we have a good understanding of this issue now. The network connection in c4.4xlarge has higher error rate than c3.4xlarge, as seen in output of sar -n ETCP 20 4
. This sometimes results in lost TCP packets which cannot be rectified after many retransmits. TCP layer only reports the problem to libpq after it has tried retransmitting for 15-16 minutes (can be configured by decreasing number of retransmits by setting net.ipv4.tcp_retries2
). When libpq sees the problems, then we get warnings like WARNING: could not consume data from worker node
, as expected.
So I think we are almost done with this issue, unless someone has more questions or concerns.
Should we report this on the EC2 forum?
OK, I tried reducing net.ipv4.tcp_retries2 from 15 to 10
The open question that I have is the following: Are we going to instruct our users for doing this? What is the trade-off if we decrease it too much?
On Thu, Jan 11, 2018 at 7:46 AM, Marco Slot notifications@github.com wrote:
Should we report this on the EC2 forum?
Yes, I think we should do that. Currently I am looking at the packets when this issue happened to see if I can understand this more. I'll post it in EC2 forums after that. Maybe tomorrow.
Here is the action plan for this issue at this point:
net.ipv4.tcp_retries2
to understand what value should we recommend to our users. With a brief search I did, RFC 1122 suggests retransmission timeout should be at least 100 seconds, which means net.ipv4.tcp_retries2
should be at least 8. On the other hand, in IBM DB2's docs a value of 3 is suggested.If anyone has more ideas, please post here.
Just for documentation:
netstat -i
on coordinator and workers doesn't show any packet drops.net.ipv4.tcp_tw_reuse
and widening net.ipv4.ip_local_port_range
made the problem go away, but made the tps much lower (100 transactions per second). So I think the problem went away because of low load, and this is not proper fix. See linux/Documentation/networking/ip-sysctl.txt for explanation of what these config vars are. I tried this because problem happened after some port reuses. After some research, I think that is unrelated.Posted this on EC2 forums: https://forums.aws.amazon.com/thread.jspa?messageID=825819
Encountered this problem, FYI. It had ramifications.
We've removed real-time executor from the code-base. The solution here is to upgrade Citus to 8.3+
Iirc this came down to TCP getting black holed on AWS, so it could still occur with the new executor if that issue still exists, but it does not seem like something we could really fix.
Steps to re-produce (almost every time it happens):
fab pg.set_config:max_wal_size,"'5GB'" fab pg.set_config:max_connections,1000
Load some data with pgbench:
Then pgbench command:
pgbench -f insert_only.sql -P 3 -c128 -j512 -t 1000 -n
Real-time select pgbench file:
pgbench command:
pgbench -f select.sql -P 3 -c32 -j32 -t 512 -n
At some point, some of the clients blocks with the following backtrace
Some random notes:
-j % -c %
with the pgbench, we hit this issue more frequently. We see much moreWARNING: could not establish asynchronous connection after 5000 ms
and other connection errors. And, some of the clients got stuck with the same backtrace.0 0x00007fa31a33e9d0 in __poll_nocancel () from /lib64/libc.so.6
1 0x00007fa3126cfd77 in MultiClientWait (waitInfo=waitInfo@entry=0x21a2a90) at executor/multi_client_executor.c:822
2 0x00007fa3126d1343 in MultiRealTimeExecute (job=job@entry=0x20cc6f8) at executor/multi_real_time_executor.c:177
3 0x00007fa3126d04aa in RealTimeExecScan (node=0x21200d0) at executor/multi_executor.c:274
4 0x000000000063178d in ExecCustomScan (pstate=0x21200d0) at nodeCustom.c:118
5 0x00000000006239f0 in ExecProcNodeFirst (node=0x21200d0) at execProcnode.c:430
6 0x0000000000638570 in ExecProcNode (node=0x21200d0) at ../../../src/include/executor/executor.h:250
7 ExecLimit (pstate=0x211f918) at nodeLimit.c:95
8 0x00000000006239f0 in ExecProcNodeFirst (node=0x211f918) at execProcnode.c:430
9 0x000000000061ec76 in ExecProcNode (node=0x211f918) at ../../../src/include/executor/executor.h:250
10 ExecutePlan (estate=estate@entry=0x211f700, planstate=0x211f918, use_parallel_mode=0 '\000', operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=1 '\001', numberTuples=numberTuples@entry=0,
11 0x000000000061fb49 in standard_ExecutorRun (queryDesc=0x209ad90, direction=ForwardScanDirection, count=0, execute_once=) at execMain.c:363
12 0x000000000061fbc5 in ExecutorRun (queryDesc=queryDesc@entry=0x209ad90, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=) at execMain.c:306
13 0x00000000007626b8 in PortalRunSelect (portal=portal@entry=0x20c0f58, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x2105ba0) at pquery.c:932
14 0x0000000000763cd4 in PortalRun (portal=portal@entry=0x20c0f58, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=0 '\000', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2105ba0,
15 0x00007fa3126ce9c6 in ExecuteIntoDestReceiver (query=query@entry=0x2104268, params=params@entry=0x0, dest=dest@entry=0x2105ba0) at executor/insert_select_executor.c:181
16 0x00007fa3126ceae8 in ExecuteSelectIntoRelation (targetRelationId=targetRelationId@entry=16740, insertTargetList=insertTargetList@entry=0x2105030, selectQuery=selectQuery@entry=0x2104268,
17 0x00007fa3126cebb8 in CoordinatorInsertSelectExecScan (node=0x2098f98) at executor/insert_select_executor.c:75
18 0x000000000063178d in ExecCustomScan (pstate=0x2098f98) at nodeCustom.c:118
19 0x00000000006239f0 in ExecProcNodeFirst (node=0x2098f98) at execProcnode.c:430
20 0x000000000061ec76 in ExecProcNode (node=0x2098f98) at ../../../src/include/executor/executor.h:250
21 ExecutePlan (estate=estate@entry=0x2098d80, planstate=0x2098f98, use_parallel_mode=0 '\000', operation=operation@entry=CMD_INSERT, sendTuples=sendTuples@entry=0 '\000', numberTuples=numberTuples@entry=0,
22 0x000000000061fb49 in standard_ExecutorRun (queryDesc=0x1fad6a0, direction=ForwardScanDirection, count=0, execute_once=execute_once@entry=1 '\001') at execMain.c:363
23 0x000000000061fbc5 in ExecutorRun (queryDesc=queryDesc@entry=0x1fad6a0, direction=direction@entry=ForwardScanDirection, count=count@entry=0, execute_once=execute_once@entry=1 '\001') at execMain.c:306
24 0x0000000000762d45 in ProcessQuery (plan=plan@entry=0x2101028, sourceText=, params=0x0, queryEnv=0x0, dest=dest@entry=0x2101348, completionTag=completionTag@entry=0x7ffc6ce39e00 "")
25 0x0000000000762f74 in PortalRunMulti (portal=portal@entry=0x20c0e40, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=dest@entry=0x2101348,
26 0x0000000000763d70 in PortalRun (portal=portal@entry=0x20c0e40, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2101348,
27 0x00000000007603e2 in exec_simple_query (query_string=query_string@entry=0x202e1d0 "INSERT INTO test_table_1 SELECT * FROM test_table_1 LIMIT 10;") at postgres.c:1099
28 0x000000000076200a in PostgresMain (argc=, argv=argv@entry=0x1fafe28, dbname=0x1fafd40 "ec2-user", username=) at postgres.c:4090
29 0x00000000006eaad7 in BackendRun (port=port@entry=0x1fa64d0) at postmaster.c:4357
30 0x00000000006ec8c5 in BackendStartup (port=port@entry=0x1fa64d0) at postmaster.c:4029
31 0x00000000006ecb6e in ServerLoop () at postmaster.c:1753
32 0x00000000006ede10 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1f638c0) at postmaster.c:1361
33 0x000000000066372f in main (argc=3, argv=0x1f638c0) at main.c:228
[ec2-user@ip-10-192-0-212 ~]$ pgbench -f insert_select.sql -P 3 -c32 -j32 -t 512 starting vacuum...ERROR: relation "pgbench_branches" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_tellers" does not exist (ignoring this error and continuing anyway) ERROR: relation "pgbench_history" does not exist (ignoring this error and continuing anyway) end. progress: 3.0 s, 214.7 tps, lat 132.501 ms stddev 246.319 progress: 6.0 s, 224.3 tps, lat 155.069 ms stddev 221.287 progress: 9.0 s, 222.3 tps, lat 135.148 ms stddev 187.363 progress: 12.0 s, 219.0 tps, lat 151.494 ms stddev 205.526 progress: 15.0 s, 222.0 tps, lat 141.065 ms stddev 169.540 progress: 18.0 s, 222.7 tps, lat 137.910 ms stddev 197.463 progress: 21.0 s, 222.6 tps, lat 141.703 ms stddev 220.887 progress: 24.0 s, 224.4 tps, lat 148.118 ms stddev 215.877 progress: 27.0 s, 224.3 tps, lat 138.928 ms stddev 190.332 WARNING: could not establish asynchronous connection after 5000 ms progress: 30.0 s, 222.4 tps, lat 149.898 ms stddev 269.491 progress: 33.0 s, 220.9 tps, lat 142.246 ms stddev 185.783 progress: 36.0 s, 223.4 tps, lat 147.924 ms stddev 206.710 progress: 39.0 s, 221.0 tps, lat 144.701 ms stddev 210.550 progress: 42.0 s, 221.7 tps, lat 141.649 ms stddev 186.612 progress: 45.0 s, 225.0 tps, lat 140.926 ms stddev 222.316 progress: 48.0 s, 222.6 tps, lat 146.699 ms stddev 198.950 progress: 51.0 s, 225.0 tps, lat 143.601 ms stddev 222.813 progress: 54.0 s, 224.4 tps, lat 136.603 ms stddev 184.196 progress: 57.0 s, 219.7 tps, lat 145.607 ms stddev 232.662 progress: 60.0 s, 226.4 tps, lat 135.318 ms stddev 209.489 progress: 63.0 s, 224.3 tps, lat 137.365 ms stddev 225.198 progress: 66.0 s, 224.3 tps, lat 132.615 ms stddev 156.285 progress: 69.0 s, 222.7 tps, lat 115.979 ms stddev 142.710 progress: 72.0 s, 223.3 tps, lat 71.027 ms stddev 22.280