Last implementation of analyze tables is not preformant/crashes.

Tencent / TBase

TBase is an enterprise-level distributed HTAP database. Through a single database cluster to provide users with highly consistent distributed database services and high-performance data warehouse services, a set of integrated enterprise-level solutions is formed.

Other

1.38k stars 262 forks source link

Last implementation of analyze tables is not preformant/crashes. #62

Open yazun opened 4 years ago

yazun commented 4 years ago

We see that implementation of creating stats changed in the recent release. Unfortunately it seems to be never finishing (i.e. already taking 40 minutes, to be seen if finishes at all) on partitions that took in order of 10 minutes to analyze before.

Seems like a serious regression unfortunately.

  1 | dr3_ops_cs36 | datanode1  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.467587+02
     1 | dr3_ops_cs36 | datanode10 | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.468017+02
     1 | dr3_ops_cs36 | datanode11 | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.470717+02
     1 | dr3_ops_cs36 | datanode12 | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.46948+02
     1 | dr3_ops_cs36 | datanode2  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.469417+02
     1 | dr3_ops_cs36 | datanode3  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.469465+02
     1 | dr3_ops_cs36 | datanode4  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.470471+02
     1 | dr3_ops_cs36 | datanode5  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.470168+02
     1 | dr3_ops_cs36 | datanode6  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.470467+02
     1 | dr3_ops_cs36 | datanode7  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.471108+02
     1 | dr3_ops_cs36 | datanode8  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.471732+02
     1 | dr3_ops_cs36 | datanode9  | SAMPLE dr3_ops_cs36_part.source_20(3000000)                                                                                                                                 | 2020-07-23 16:18:46.472935+

yazun commented 4 years ago

To quantify the problem: Same data, Tbase partition is smaller as it's 12 node cluster for the same data as XL, around 20M records in the partition:

analyze verbose source_20;
INFO:  analyzing "dr3_ops_cs36_part.source_20"
ANALYZE
Time: 2676322.855 ms (44:36.323)

XL on 8 nodes cluster (CPU bound):

analyze verbose source_20;
INFO:  analyzing "dr3_ops_cs36_part.source_20"
ANALYZE
Time: 451978.182 ms (07:31.978)

so XL analyze is 6x faster on 30% more data per datanode (of course this does not include analyze coordinator to distribute stats to all coordinators).

JennyJennyChen commented 4 years ago

Thank you for your issue. Our internal developers are confirming this issue and we will provide feedback immediately if there is any progress.

Annashuo commented 4 years ago

To quantify the problem: Same data, Tbase partition is smaller as it's 12 node cluster for the same data as XL, around 20M records in the partition:
analyze verbose source_20;
INFO:  analyzing "dr3_ops_cs36_part.source_20"
ANALYZE
Time: 2676322.855 ms (44:36.323)
XL on 8 nodes cluster (CPU bound):
analyze verbose source_20;
INFO:  analyzing "dr3_ops_cs36_part.source_20"
ANALYZE
Time: 451978.182 ms (07:31.978)
so XL analyze is 6x faster on 30% more data per datanode (of course this does not include analyze coordinator to distribute stats to all coordinators).

Hi yazun, thank you for your issue! In the recent release, creating stats on coordinator includes sampling data from all datanodes and calculating stats on the coordinator, which is more accurate but slower than the previous or PGXL. PGXL creates stats on coordinator by simply collecting stats from all datanodes and summing them up or randomly choosing stats from one datanode. This can cause inaccurate stats and slower query execution. To fix this problem, in the recent release, we force the coordinator to sample data rows from datanodes and calculate stats based on those data rows. This would take more time since the data rows would be transferred from datanodes to the coordinator. We will try to improve the performance of the sampling part in the future.

yazun commented 3 years ago

I would imagine that merging of the histograms could be possible, so samples should not go to the coordinator or at least the whole process should be parallelized, so only the final merge of samples is sequential?

Sadly, if run analyze on partitioned tables it fails after several hours (versus minutes in XL):

INFO:  analyzing "dr3_ops_cs36.source" inheritance tree
INFO:  analyzing "dr3_ops_cs36_part.source_1"
INFO:  analyzing "dr3_ops_cs36_part.source_2"
INFO:  analyzing "dr3_ops_cs36_part.source_3"
INFO:  analyzing "dr3_ops_cs36_part.source_4"
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 15656533.429 ms (04:20:56.533)

Interestingly, there's no core dump.

yazun commented 3 years ago

It's 100% repeatable:

(dr3_ops_cs36@[local:/tmp/pg_stat_tmp/xzDR3/]:55431) [surveys] > analyze verbose ts;
^ApINFO:  analyzing "dr3_ops_cs36.ts" inheritance tree
INFO:  analyzing "dr3_ops_cs36_part.ts_1"
INFO:  analyzing "dr3_ops_cs36_part.ts_2"
INFO:  analyzing "dr3_ops_cs36_part.ts_3"
INFO:  analyzing "dr3_ops_cs36_part.ts_4"
INFO:  analyzing "dr3_ops_cs36_part.ts_5"
INFO:  analyzing "dr3_ops_cs36_part.ts_6"
INFO:  analyzing "dr3_ops_cs36_part.ts_7"
INFO:  analyzing "dr3_ops_cs36_part.ts_8"
INFO:  analyzing "dr3_ops_cs36_part.ts_9"
INFO:  analyzing "dr3_ops_cs36_part.ts_10"
INFO:  analyzing "dr3_ops_cs36_part.ts_11"
INFO:  analyzing "dr3_ops_cs36_part.ts_12"
INFO:  analyzing "dr3_ops_cs36_part.ts_13"
INFO:  analyzing "dr3_ops_cs36_part.ts_14"
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 28127614.205 ms (07:48:47.614)

These partitions are around 700M large records each, distributed over 12 nodes.

yazun commented 3 years ago

It's a blocker for us, without proper stats many query plans are bogus. Would it be possible to bring back the old behaviour until this one is fixed, via GUC?

yazun commented 3 years ago

At least already analyzed tables have the stats updated.

yazun commented 3 years ago

Seems the coordinating process is killed. Digging in the logs, reveals: `

<166>Aug 8 14:33:26 gaiadb01 coord[1692]: [26] 2020-08-08 14:33:26 CEST [1692,coord(1692,0)]:xid[0-] [55-1] user=,db=,client=,query=,LOG: 00000: server process (PID 38483) was terminated by signal 9: Killed#0122020-08-08 14:33:26 CEST [1692,coord(1692,0)]:xid[0-] [56-1] user=,db=,client=,query=,DETAIL: Failed process was running: #012#011 analyze verbose source_12 (sourceid);#0122020-08-08 14:33:26 CEST [1692,coord(1692,0)]:xid[0-] [57-1] user=,db=,client=,query=,LOCATION: LogChildExit, postmaster.c:4143 ` And indeed, OOM killer kills coordinator process doing analyze: ``` [Sat Aug 8 14:33:03 2020] Out of memory: Kill process 38483 (postgres) score 72 or sacrifice child [Sat Aug 8 14:33:03 2020] Killed process 38483 (postgres), UID 10256, total-vm:34334916kB, anon-rss:19339084kB, file-rss:224kB, shmem-rss:2856kB ``` There's at least 90GB free when analyze starts. Seems like a huge leak or a logic problem in the new analyze code..

yazun commented 3 years ago

With latest changes in the statistics code the analyze (coordinator) is also prohibitively slow and there's a need to gather stats on all the coordinators. What happened that analyze (coordinator) got so slow?? It should pull stats only from datanodes or eventually the most up to date coordinator. Related to #32

Eric5553 commented 3 years ago

@yazun , thanks so much for your professional investigations on the stats issue. As @Annashuo mentioned, in recent release, we use sampling mechanism instead of simple summing-up/random-pick datanode statistic to get more accurate statistics in coordinators. Which was proven by some of our customers that the new strategy got better plan optimization results.

At the meanwhile, your observations are also useful to us to improve the new stats gathering strategy. It exposed the potential performance penalty that we have to take in considerations. Here are my thoughts:

Provide a GUC to switch back to prior strategy.
Improve the sampling strategy for partition table and large volume tables. Maybe we could add flexible sampling factors based on data volume and other basic stats to suppress the network and CPU costs. And other potential code improvements.
For sure, the ultimate way to solve the statistic problem might be calculating distributed histogram and other extend-statistics by advanced mathematical algorithms. But it would take long time to invent/polish/prove the algorithms. And also, there could be more upcoming statistic features which can not fit into the algorithms easily.

Thus, as a trade-off, we will try to solve the problem by method 1&2. Will get back to you once we got any progress :-)

Thanks, Eric Wu

Eric5553 commented 3 years ago

@yazun ，we just added a new GUC enable_sampling_analyze to specify the analyze method. Default setting is TURE, using the sampling methods. You can turn it off to force using the previous sum-up methods. Thanks.

yazun commented 3 years ago

Thank you very much for the prompt fix @Eric5553 ! This will unblock us to continue the system testing. Krzysztof