ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
35.61k stars 6.65k forks source link

Materialized view with refresh #61712

Open VladislavePolyakov opened 3 months ago

VladislavePolyakov commented 3 months ago

When creating MV got error and database down and can't start anymore.

Expectation is created MV and no down database.

A link to reproducer in https://fiddle.clickhouse.com/.

create table devices_pg on cluster cluster_name
(
 id                             UUID,
 device_models_dict_code        String,
 comment                        String,
 source                         String,
 created_by                     UUID,
 updated_by                     UUID,
 created_at                     DateTime,
 updated_at                     DateTime,
 deleted_at                     Nullable(DateTime),
 status                         String,
 platform                       String,
 compliance                     String,
 tenant_code                    String
) ENGINE = PostgreSQL(getMacro('host_port'), getMacro('dbname'), 'devices', getMacro('dbuser'), getMacro('dbpassword'), 'schema_name');
SELECT *
FROM devices_pg
1200 rows in set. Elapsed: 0.020 sec. Processed 1.20 thousand rows, 345.63 KB (61.29 thousand rows/s., 17.65 MB/s.)
Peak memory usage: 551.56 KiB.
CREATE MATERIALIZED VIEW devices_mv ON CLUSTER ocs
REFRESH EVERY 20 MINUTE
ENGINE = MergeTree
ORDER BY tuple() AS
SELECT *
FROM devices_pg

Query id: afffabc5-5c6f-4337-a8b8-6414af2e5398

↘ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)                                                                               
↖ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)                                                                               
Elapsed: 9.077 sec. 

Received exception from server (version 24.1.6):
Code: 999. DB::Exception: Received from localhost:9000. Coordination::Exception. Coordination::Exception: Coordination error: Connection loss: While executing DDLQueryStatus. (KEEPER_EXCEPTION)

SELECT 1

Query id: d7cae4f6-1dc2-4def-a5fa-c1a94d5a85ac

Connecting to localhost:9000 as user default.
Exception on client:
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

How to reproduce

Stack trace:

2024.03.21 16:41:04.397172 [ 24530 ] {afffabc5-5c6f-4337-a8b8-6414af2e5398} <Error> executeQuery: Code: 999. Coordination::Exception: Coordination error: Connection loss: While executing DDLQueryStatus. (KEEPER_EXCEPTION) (version 24.1.6.52 (official build)) (from 127.0.0.1:43230) (in query: CREATE MATERIALIZED VIEW devices_mv ON CLUSTER ocs REFRESH EVERY 20 MINUTE ENGINE = MergeTree ORDER BY tuple() AS SELECT * FROM devices_pg), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c80359b in /usr/bin/clickhouse
1. DB::Exception::Exception<char const*>(int, FormatStringHelperImpl<std::type_identity<char const*>::type>, char const*&&) @ 0x000000000af890d0 in /usr/bin/clickhouse
2. Coordination::Exception::Exception(Coordination::Error) @ 0x000000001300da82 in /usr/bin/clickhouse
3. zkutil::ZooKeeper::tryMulti(std::vector<std::shared_ptr<Coordination::Request>, std::allocator<std::shared_ptr<Coordination::Request>>> const&, std::vector<std::shared_ptr<Coordination::Response>, std::allocator<std::shared_ptr<Coordination::Response>>>&, bool) @ 0x000000001301abc2 in /usr/bin/clickhouse
4. zkutil::MultiReadResponses<Coordination::ListResponse, true> zkutil::ZooKeeper::multiRead<Coordination::ListResponse, true, std::__wrap_iter<String const*>>(std::__wrap_iter<String const*>, std::__wrap_iter<String const*>, std::function<std::shared_ptr<Coordination::Request> (String const&)>, std::function<std::future<Coordination::ListResponse> (String const&)>) @ 0x000000000fd8902c in /usr/bin/clickhouse
5. void DB::ZooKeeperRetriesControl::retryLoop<DB::DDLQueryStatusSource::generate()::$_0&, void DB::ZooKeeperRetriesControl::retryLoop<DB::DDLQueryStatusSource::generate()::$_0>(DB::DDLQueryStatusSource::generate()::$_0&&)::'lambda'()>(DB::DDLQueryStatusSource::generate()::$_0&&, void DB::ZooKeeperRetriesControl::retryLoop<DB::DDLQueryStatusSource::generate()::$_0>(DB::DDLQueryStatusSource::generate()::$_0&&)::'lambda'()&&) @ 0x00000000118f5562 in /usr/bin/clickhouse
6. DB::DDLQueryStatusSource::generate() @ 0x00000000118ed029 in /usr/bin/clickhouse
7. DB::ISource::tryGenerate() @ 0x0000000012976655 in /usr/bin/clickhouse
8. DB::ISource::work() @ 0x00000000129760a3 in /usr/bin/clickhouse
9. DB::ExecutionThreadContext::executeTask() @ 0x000000001298f07a in /usr/bin/clickhouse
10. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x0000000012985ad0 in /usr/bin/clickhouse
11. DB::PipelineExecutor::execute(unsigned long, bool) @ 0x0000000012984ce0 in /usr/bin/clickhouse
12. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<true>::ThreadFromGlobalPoolImpl<DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0>(DB::PullingAsyncPipelineExecutor::pull(DB::Chunk&, unsigned long)::$_0&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x0000000012992a03 in /usr/bin/clickhouse
13. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000c8efd7e in /usr/bin/clickhouse
14. start_thread @ 0x0000000000007ea5 in /usr/lib64/libpthread-2.17.so
15. clone @ 0x00000000000feb0d in /usr/lib64/libc-2.17.so
VladislavePolyakov commented 3 months ago

On latest version same story ClickHouse client version 24.2.2.71 (official build). Connecting to localhost:9000 as user default. Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)

VladislavePolyakov commented 3 months ago

information from Keeper befor start DDL

strange looking that Node count: 13, actualy we have only 3 nodes for Keeper and 4 for Database

echo mntr | nc localhost 9181
zk_version  v24.2.2.71-stable-9293d361e72be9f6ccfd444d504e2137b2e837cf
zk_avg_latency  2
zk_max_latency  16
zk_min_latency  0
zk_packets_received 28
zk_packets_sent 28
zk_num_alive_connections    1
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count  13
zk_watch_count  1
zk_ephemerals_count 0
zk_approximate_data_size    2179
zk_key_arena_size   0
zk_latest_snapshot_size 0
zk_open_file_descriptor_count   75
zk_max_file_descriptor_count    500000
zk_followers    2
zk_synced_followers 2
echo ruok | nc localhost 9181
imok
echo srvr | nc localhost 9181
ClickHouse Keeper version: v24.2.2.71-stable-9293d361e72be9f6ccfd444d504e2137b2e837cf
Latency min/avg/max: 0/2/16
Received: 31
Sent: 31
Connections: 1
Outstanding: 0
Zxid: 0x68
Mode: leader
Node count: 13
echo stat | nc localhost 9181
ClickHouse Keeper version: v24.2.2.71-stable-9293d361e72be9f6ccfd444d504e2137b2e837cf
Clients:
 127.0.0.1:45444(recved=0,sent=0)
 10.189.218.11:45828(recved=33,sent=33)

Latency min/avg/max: 0/2/16
Received: 33
Sent: 33
Connections: 1
Outstanding: 0
Zxid: 0x70
Mode: leader
Node count: 13
echo conf | nc localhost 9181
server_id=1
enable_ipv6=true
tcp_port=9181
four_letter_word_allow_list=jmst,jmfp,jmep,jmdp,conf,cons,crst,envi,ruok,srst,srvr,stat,wchs,dirs,mntr,isro,rcvr,apiv,csnp,lgif,rqld,rclc,clrs,ftfl,ydld,pfev
max_requests_batch_size=100
min_session_timeout_ms=10000
session_timeout_ms=30000
operation_timeout_ms=10000
dead_session_check_period_ms=500
heart_beat_interval_ms=500
election_timeout_lower_bound_ms=1000
election_timeout_upper_bound_ms=2000
reserved_log_items=100000
snapshot_distance=100000
auto_forwarding=true
shutdown_timeout=5000
startup_timeout=180000
raft_logs_level=warning
snapshots_to_keep=3
rotate_log_storage_interval=100000
stale_log_gap=10000
fresh_log_gap=200
max_requests_batch_size=100
max_requests_batch_bytes_size=102400
max_flush_batch_size=1000
max_request_queue_size=100000
max_requests_quick_batch_size=100
quorum_reads=false
force_sync=true
compress_logs=false
compress_snapshots_with_zstd_format=true
configuration_change_tries_count=20
raft_limits_reconnect_limit=50
async_replication=false
log_storage_path=/var/lib/clickhouse/coordination/log
log_storage_disk=LocalLogDisk
snapshot_storage_path=/var/lib/clickhouse/coordination/snapshots
snapshot_storage_disk=LocalSnapshotDisk
echo cons | nc localhost 9181
 127.0.0.1:45452(recved=0,sent=0,sid=0xffffffffffffffff,lop=NA,est=1711032588305,to=0,lzxid=0xffffffffffffffff,lresp=0,llat=0,minlat=0,avglat=0,maxlat=0)
 10.189.218.11:45828(recved=37,sent=37,sid=0x0000000000000002,lop=List,est=1711032319539,to=30000,lcxid=0x000000000000000b,lzxid=0x0000000000000010,lresp=1711032319581,llat=0,minlat=0,avglat=2,maxlat=16)
 echo envi | nc localhost 9181
Environment:
clickhouse.keeper.version=v24.2.2.71-stable-9293d361e72be9f6ccfd444d504e2137b2e837cf
host.name=ocs-clickhouse01.devel.pro
os.name=Linux
os.arch=x86_64
os.version=3.10.0-1160.105.1.el7.x86_64
cpu.count=6
user.name=
user.home=/home/clickhouse/
user.dir=/
user.tmp=/tmp/
echo isro | nc localhost 9181
rw
echo wchs | nc localhost 9181
1 connections watching 1 paths
Total watches:1
al13n321 commented 3 months ago

Sounds like the other server crashed when trying to create the table. There's a known problem that would cause that, fixed in https://github.com/ClickHouse/ClickHouse/pull/58934 (the change in ASTCreateQuery::clone()).

strange looking that Node count: 13, actualy we have only 3 nodes for Keeper and 4 for Database

The first output says zk_znode_count - that's number of znodes ("files"/"directories" in zookeeper), not number of servers. I'm guessing "Node count" in the other two outputs means znodes as well.

VladislavePolyakov commented 3 months ago

@al13n321 Michael thank you for answer. Do you have builded version? I would like to test this fix on out environmemt.

al13n321 commented 3 months ago

Build: https://s3.amazonaws.com/clickhouse-test-reports/58934/acccccbaf41e02d7e2d42662cb83d9799924b936/clickhouse_build_check/report.html (from "ClickHouse build check" in the list of checks in the PR).

Hubbitus commented 3 months ago

@al13n321, is it available as docker container? When this is planned to be released?

VladislavePolyakov commented 3 months ago

Build: https://s3.amazonaws.com/clickhouse-test-reports/58934/acccccbaf41e02d7e2d42662cb83d9799924b936/clickhouse_build_check/report.html (from "ClickHouse build check" in the list of checks in the PR).

tested and it working thanks

VladislavePolyakov commented 3 months ago

More detailed tested MV with REFRESH and all migrations applied sucsessful, but on refresh got error.

SELECT *
FROM system.view_refreshes

Query id: 53bca73b-3f93-4c9c-8916-e13c3f2e5998

Row 1:
──────
database:               default
view:                   applications_mv
status:                 Scheduled
last_refresh_result:    Error
last_refresh_time:      2024-04-01 16:40:08
last_success_time:      ᴺᵁᴸᴸ
duration_ms:            8105
next_refresh_time:      2024-04-01 16:50:00
remaining_dependencies: []
exception:              Code: 48. DB::Exception: RENAME EXCHANGE is not supported. (NOT_IMPLEMENTED), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cc4f3db
1. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000762ac63
2. DB::DatabaseAtomic::renameTable(std::shared_ptr<DB::Context const>, String const&, DB::IDatabase&, String const&, bool, bool) @ 0x00000000101af8c8
3. DB::InterpreterRenameQuery::execute() @ 0x0000000010dea198
4. DB::StorageMaterializedView::exchangeTargetTable(DB::StorageID, std::shared_ptr<DB::Context const>) @ 0x000000001177497d
5. DB::RefreshTask::refreshTask() @ 0x00000000121707ce
6. DB::BackgroundSchedulePool::threadFunction() @ 0x000000000fdbfc2a
7. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, char const*)::$_0>(DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, char const*)::$_0&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000fdc0c33
8. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000ccf782d
9. start_thread @ 0x0000000000007ea5
10. clone @ 0x00000000000feb0d
 (version 24.3.1.2153)
retry:                  0
refresh_count:          0
progress:               inf
elapsed:                0
read_rows:              469472
read_bytes:             37158348
total_rows:             0
total_bytes:            0
written_rows:           0
written_bytes:          0
result_rows:            0
result_bytes:           0

@al13n321 Could you recommend something about this error?

al13n321 commented 1 month ago

"RENAME EXCHANGE is not supported" means the Linux kernel is older than 3.15 (or Mac OS is older than 10.12), required for refreshable MV.