Open yazun opened 4 years ago
To quantify the problem: Same data, Tbase partition is smaller as it's 12 node cluster for the same data as XL, around 20M records in the partition:
analyze verbose source_20;
INFO: analyzing "dr3_ops_cs36_part.source_20"
ANALYZE
Time: 2676322.855 ms (44:36.323)
XL on 8 nodes cluster (CPU bound):
analyze verbose source_20;
INFO: analyzing "dr3_ops_cs36_part.source_20"
ANALYZE
Time: 451978.182 ms (07:31.978)
so XL analyze is 6x faster on 30% more data per datanode (of course this does not include analyze coordinator
to distribute stats to all coordinators).
Thank you for your issue. Our internal developers are confirming this issue and we will provide feedback immediately if there is any progress.
To quantify the problem: Same data, Tbase partition is smaller as it's 12 node cluster for the same data as XL, around 20M records in the partition:
analyze verbose source_20; INFO: analyzing "dr3_ops_cs36_part.source_20" ANALYZE Time: 2676322.855 ms (44:36.323)
XL on 8 nodes cluster (CPU bound):
analyze verbose source_20; INFO: analyzing "dr3_ops_cs36_part.source_20" ANALYZE Time: 451978.182 ms (07:31.978)
so XL analyze is 6x faster on 30% more data per datanode (of course this does not include
analyze coordinator
to distribute stats to all coordinators).
Hi yazun, thank you for your issue! In the recent release, creating stats on coordinator includes sampling data from all datanodes and calculating stats on the coordinator, which is more accurate but slower than the previous or PGXL. PGXL creates stats on coordinator by simply collecting stats from all datanodes and summing them up or randomly choosing stats from one datanode. This can cause inaccurate stats and slower query execution. To fix this problem, in the recent release, we force the coordinator to sample data rows from datanodes and calculate stats based on those data rows. This would take more time since the data rows would be transferred from datanodes to the coordinator. We will try to improve the performance of the sampling part in the future.
I would imagine that merging of the histograms could be possible, so samples should not go to the coordinator or at least the whole process should be parallelized, so only the final merge of samples is sequential?
Sadly, if run analyze on partitioned tables it fails after several hours (versus minutes in XL):
INFO: analyzing "dr3_ops_cs36.source" inheritance tree
INFO: analyzing "dr3_ops_cs36_part.source_1"
INFO: analyzing "dr3_ops_cs36_part.source_2"
INFO: analyzing "dr3_ops_cs36_part.source_3"
INFO: analyzing "dr3_ops_cs36_part.source_4"
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 15656533.429 ms (04:20:56.533)
Interestingly, there's no core dump.
It's 100% repeatable:
(dr3_ops_cs36@[local:/tmp/pg_stat_tmp/xzDR3/]:55431) [surveys] > analyze verbose ts;
^ApINFO: analyzing "dr3_ops_cs36.ts" inheritance tree
INFO: analyzing "dr3_ops_cs36_part.ts_1"
INFO: analyzing "dr3_ops_cs36_part.ts_2"
INFO: analyzing "dr3_ops_cs36_part.ts_3"
INFO: analyzing "dr3_ops_cs36_part.ts_4"
INFO: analyzing "dr3_ops_cs36_part.ts_5"
INFO: analyzing "dr3_ops_cs36_part.ts_6"
INFO: analyzing "dr3_ops_cs36_part.ts_7"
INFO: analyzing "dr3_ops_cs36_part.ts_8"
INFO: analyzing "dr3_ops_cs36_part.ts_9"
INFO: analyzing "dr3_ops_cs36_part.ts_10"
INFO: analyzing "dr3_ops_cs36_part.ts_11"
INFO: analyzing "dr3_ops_cs36_part.ts_12"
INFO: analyzing "dr3_ops_cs36_part.ts_13"
INFO: analyzing "dr3_ops_cs36_part.ts_14"
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 28127614.205 ms (07:48:47.614)
These partitions are around 700M large records each, distributed over 12 nodes.
It's a blocker for us, without proper stats many query plans are bogus. Would it be possible to bring back the old behaviour until this one is fixed, via GUC?
At least already analyzed tables have the stats updated.
Seems the coordinating process is killed. Digging in the logs, reveals: `
<166>Aug 8 14:33:26 gaiadb01 coord[1692]: [26] 2020-08-08 14:33:26 CEST [1692,coord(1692,0)]:xid[0-] [55-1] user=,db=,client=,query=,LOG: 00000: server process (PID 38483) was terminated by signal 9: Killed#0122020-08-08 14:33:26 CEST [1692,coord(1692,0)]:xid[0-] [56-1] user=,db=,client=,query=,DETAIL: Failed process was running: #012#011 analyze verbose source_12 (sourceid);#0122020-08-08 14:33:26 CEST [1692,coord(1692,0)]:xid[0-] [57-1] user=,db=,client=,query=,LOCATION: LogChildExit, postmaster.c:4143 ` And indeed, OOM killer kills coordinator process doing analyze: ``` [Sat Aug 8 14:33:03 2020] Out of memory: Kill process 38483 (postgres) score 72 or sacrifice child [Sat Aug 8 14:33:03 2020] Killed process 38483 (postgres), UID 10256, total-vm:34334916kB, anon-rss:19339084kB, file-rss:224kB, shmem-rss:2856kB ``` There's at least 90GB free when analyze starts. Seems like a huge leak or a logic problem in the new analyze code..With latest changes in the statistics code the analyze (coordinator)
is also prohibitively slow and there's a need to gather stats on all the coordinators. What happened that analyze (coordinator)
got so slow?? It should pull stats only from datanodes or eventually the most up to date coordinator.
Related to #32
@yazun , thanks so much for your professional investigations on the stats issue. As @Annashuo mentioned, in recent release, we use sampling mechanism instead of simple summing-up/random-pick datanode statistic to get more accurate statistics in coordinators. Which was proven by some of our customers that the new strategy got better plan optimization results.
At the meanwhile, your observations are also useful to us to improve the new stats gathering strategy. It exposed the potential performance penalty that we have to take in considerations. Here are my thoughts:
Thus, as a trade-off, we will try to solve the problem by method 1&2. Will get back to you once we got any progress :-)
Thanks, Eric Wu
@yazun ,we just added a new GUC enable_sampling_analyze
to specify the analyze method. Default setting is TURE
, using the sampling methods. You can turn it off to force using the previous sum-up methods. Thanks.
Thank you very much for the prompt fix @Eric5553 ! This will unblock us to continue the system testing. Krzysztof
We see that implementation of creating stats changed in the recent release. Unfortunately it seems to be never finishing (i.e. already taking 40 minutes, to be seen if finishes at all) on partitions that took in order of 10 minutes to analyze before.
Seems like a serious regression unfortunately.