ApsaraDB / PolarDB-for-PostgreSQL

A cloud-native database based on PostgreSQL developed by Alibaba Cloud.
https://apsaradb.github.io/PolarDB-for-PostgreSQL/zh/
Apache License 2.0
2.89k stars 480 forks source link

[ERROR] Cannot use pgbench to test the performance of the cluster #109

Closed Essoz closed 3 years ago

Essoz commented 3 years ago

I am deploying the cluster on a Intel 8272CL 8 Cores 64GiB Azure VM (Standard E8s_v4) using the default configuration. The cluster has been configured according to 性能白皮书 of PolarDB for PG.

This "PQputline failed" of pgbench (the benchmark built in to PostgreSQL) frequently occurs. When I use /usr/pgsql-11/bin/pgbench -i -s 1000 to generate testing data, the generating process always fail at some point.

For example, after only generating 3 percent of the data,

...
3600000 of 100000000 tuples (3%) done (elapsed 31.17 s, remaining 834.65 s)
3700000 of 100000000 tuples (3%) done (elapsed 31.41 s, remaining 817.60 s)
PQputline failed
connection to database "postgres" failed:
FATAL:  the database system is in recovery mode
connection to database "postgres" failed:
FATAL:  the database system is in recovery mode
connection to database "postgres" failed:
FATAL:  the database system is in recovery mode
connection to database "postgres" failed:
FATAL:  the database system is in recovery mode

Sometimes the problem occurs when more than 40% data has been generated. I noticed a significant difference between successful data generations and failed ones. The ones that are to fail will have their remaining time varing a lot like this.

100000 of 100000000 tuples (0%) done (elapsed 0.04 s, remaining 40.28 s)
200000 of 100000000 tuples (0%) done (elapsed 0.08 s, remaining 40.40 s)
300000 of 100000000 tuples (0%) done (elapsed 0.12 s, remaining 40.14 s)
400000 of 100000000 tuples (0%) done (elapsed 4.10 s, remaining 1021.15 s)
500000 of 100000000 tuples (0%) done (elapsed 4.31 s, remaining 857.37 s)
600000 of 100000000 tuples (0%) done (elapsed 5.55 s, remaining 919.30 s)

The remaining time might have rocketed and fallen during the generation process, whereas the successful ones will have their remaining time kind of linearly decreasing (the whole data generating process takes about 700s) (edited)

Essoz commented 3 years ago

Sorry, I made a silly mistake. Please follow testing instructions in the repo instead of in the 性能白皮书.

Anton-Shutik commented 2 years ago

@Essoz I'm getting exactly same error. It goes away if you set -s 100 or less, but it constantly fails on 1000. How did you fixed it ?

Essoz commented 2 years ago

@Essoz I'm getting exactly same error. It goes away if you set -s 100 or less, but it constantly fails on 1000. How did you fixed it ?

Though how large the scalefactor is somewhat related to the server you used. You can probably set larger scalefactor on a high performance machine with large disks.

But, The version here is different from the business version. So the profiling instructions in white-book do not apply to the version here. Instead, try to follow instructions in benchmark.md.

To put it another way, don't use scalefactor larger than 32🦥