Very poor TPCC performance if using citus extension on PostgreSQL

GangLiCN commented 4 years ago

[Summary] I'm hitting a very tough issue: The tpcc performance is very poor (about 1/10 of single DB) if using citus extension against PostgreSQL。

【Environment】 PostgreSQL: 11.X Benchmarksql: 5.0 Citus: 9.4 Coordinate node: 1 Work nodes: 2

[Description]

Tested TPCC aginst single PostgreSQL instance, tpmC is about 20000-30000;
Installed and configured citus extension, then modified "tableCreates.sql" for table creation: ...... SET citus.replication_model = 'streaming'; SELECT create_distributed_table('bmsql_config', 'cfg_name'); SELECT create_distributed_table('bmsql_new_order', 'no_w_id'); SELECT create_distributed_table('bmsql_oorder', 'o_w_id'); SELECT create_distributed_table('bmsql_order_line', 'ol_w_id'); SELECT create_distributed_table('bmsql_history','h_c_w_id'); SELECT create_distributed_table('bmsql_customer','c_w_id');

SELECT create_reference_table('bmsql_item'); SELECT create_reference_table('bmsql_district'); SELECT create_reference_table('bmsql_warehouse'); SELECT create_reference_table('bmsql_stock');

The reason why I created reference table for [warehouse] ,[stock],[item],[district] is because the foreign key's limitation, which caused I couldn't create distributed table for all type tables.

Executed "runDatabaseBuild.sh " to load warehouse data into database, then created index and foreign keys, finally vacuumed all table's statistics data;
Executed "runBenchmark.sql " to run TPCC test, unfortunately the result is very poor, "tpmC" is about 2000。

【Questions】

Can some one help to find out what's wrong with my test steps ? Why is the tpcc performance on citus db so poor ?
During TPCC test running, I noticed that the overall workload on coordinate node is very high, but the workload on worker nodes is very small,(both for disk and network). So, I guess if this is citus's bug or limitation (cross-node's transaction is not handled/distributed correctly in citus db) ? Is there any parameters from citis side can be adjusted or optimized ?

Thanks advance for your help !

Attachments is two screenshots for different tpcc test results。 test env: 100 warehouses, 20clients.

pg11_bmsql50_100wh_20clients_10min_latest 100wh_10clients_concurrent_run_node1

GangLiCN commented 4 years ago

Sorry, I forgot one "important" thing。

I also tried to "convert" all worker nodes to "master" nodes (In other words, after conversion, the orignal worker node is both for coordinate and for worker), The steps is as below: 1）Run SQL statement on worker nodes: ...... truncate table pg_dist_node; truncate table pg_dist_partition; truncate table pg_dist_shard;; --truncate table pg_dist_shard_placement; truncate table pg_dist_colocation;

copy pg_dist_node from PROGRAM 'psql "host=master1 port=5432 dbname=benchmarksql user=benchmarksql" -Atc "copy pg_dist_node to STDOUT"';

copy pg_dist_partition from PROGRAM 'psql "host=master1 port=5432 dbname=benchmarksql user=benchmarksql" -Atc "copy pg_dist_partition to STDOUT"';

copy pg_dist_shard from PROGRAM 'psql "host=master1 port=5432 dbname=benchmarksql user=benchmarksql" -Atc "copy pg_dist_shard to STDOUT"';

--Must execute "copy (select from xxx) to copy data from view copy pg_dist_shard_placement from PROGRAM 'psql "host=master1 port=5432 dbname=benchmarksql user=benchmarksql" -Atc "copy (select from pg_dist_shard_placement) to STDOUT"';

copy pg_dist_colocation from PROGRAM 'psql "host=master1 port=5432 dbname=benchmarksql user=benchmarksql" -Atc "copy pg_dist_colocation to STDOUT"';

Summary: After executed above sql statements, the metadata of coordinate node would transfer to work nodes, which makes orginal worker node "converted" to coordinator node as well。

2) Executed "runBenchmark.sh " on worker nodes, java runtime exeception will trigger (deadlock error detected from database side), the overall TPCC performance is a bit better than previous results(summary tpmC is about 4000 from 3 nodes)。 deadlock_issue

marcocitus commented 4 years ago

Simple OLTP benchmarks often end up being bottlenecked on #connections / response time. In Citus, response time is naturally much higher than in PostgreSQL because every statement involves an extra network round-trip, hence throughput is lower unless you make sure that there is a sufficient number of connections to keep the coordinator busy.

The other challenge with this type of OLTP benchmark is that almost all the work is in planning the query, not in executing it. In the Citus case that means the coordinator has to do about as much work as single PostgreSQL server would normally do. In practice, applications that require scale (read: Citus) often have much higher execution times (e.g. because data does not fit in memory) and this equation works out very differently.

A somewhat more sophisticated TPC-C implementation is HammerDB, which is published by the TPC-council: https://github.com/TPC-Council/HammerDB

HammerDB wraps transactions in stored procedures, and Citus can delegate the call to the procedure to a worker node with minimal planning overhead on the coordinator and in a single network round trip. Hence, it gives a better reflection of the benefits of scaling out.

We have some tooling to make it easier to run HammerDB against Citus: https://github.com/citusdata/ch-benchmark

onderkalaci commented 3 years ago

I think this blog explains even in more detail: https://www.citusdata.com/blog/2020/11/21/making-postgres-stored-procedures-9x-faster-in-citus/

citusdata / citus

Very poor TPCC performance if using citus extension on PostgreSQL #4126