ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.86k stars 6.94k forks source link

A problem with my Kubernetes: ClickHouse does not start #71557

Open karthik-thiyagarajan opened 3 weeks ago

karthik-thiyagarajan commented 3 weeks ago

We are using 3 shard 3 replica setup of clickhouse in kubernetes cluster with 3 zookeeper replica pods. We tried to add a column that already exists and dropped the same column and it caused all the pods to crash. We are using 24.6 clickhouse verison and not using any experimental features.

I've obfuscated the dbname and column for obvious reasons. I'm unable to bring up any clickhouse pods now and please help me to recover from this failure.

: While sending /var/lib/clickhouse/store/9cb/9cbafb32-5bc1-4a88-9146-9b705f93ffe9/shard2_all_replicas/1.bin. (ALL_CONNECTION_TRIES_FAILED), Stack trace (when copying this message, always include the lines below):

  1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000d02cb1b
  2. DB::NetException::NetException<String&>(int, FormatStringHelperImpl<std::type_identity<String&>::type>, String&) @ 0x0000000012226523
  3. PoolWithFailoverBase::getMany(unsigned long, unsigned long, unsigned long, unsigned long, bool, bool, std::function<PoolWithFailoverBase::TryResult (std::shared_ptr const&, String&)> const&, std::function<Priority (unsigned long)> const&) @ 0x00000000129f6d1e
  4. DB::ConnectionPoolWithFailover::getManyImpl(DB::Settings const&, DB::PoolMode, std::function<PoolWithFailoverBase::TryResult (std::shared_ptr const&, String&)> const&, std::optional, std::function<Priority (unsigned long)>, bool) @ 0x00000000129f52a7
  5. DB::ConnectionPoolWithFailover::getManyCheckedForInsert(DB::ConnectionTimeouts const&, DB::Settings const&, DB::PoolMode, DB::QualifiedTableName const&) @ 0x00000000129f56aa
  6. DB::DistributedAsyncInsertDirectoryQueue::processFile(String&, DB::SettingsChanges const&) @ 0x00000000123de090
  7. DB::DistributedAsyncInsertDirectoryQueue::processFiles(DB::SettingsChanges const&) @ 0x00000000123d4113
  8. void std::function::policy_invoker<void ()>::call_impl<std::function::default_alloc_func<DB::DistributedAsyncInsertDirectoryQueue::DistributedAsyncInsertDirectoryQueue(DB::StorageDistributed&, std::shared_ptr const&, String const&, std::shared_ptr, DB::ActionBlocker&, DB::BackgroundSchedulePool&)::$_0, void ()>>(std::function::__policy_storage const*) @ 0x00000000123e1d55
  9. DB::BackgroundSchedulePool::threadFunction() @ 0x0000000010661860
  10. void std::function::policy_invoker<void ()>::__call_impl<std::function::default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, char const)::$_0>(DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, StrongTypedef<unsigned long, CurrentMetrics::MetricTag>, char const)::$_0&&)::'lambda'(), void ()>>(std::function::policy_storage const*) @ 0x0000000010662907
  11. void std::thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::thread_struct, std::default_delete>, void ThreadPoolImpl::scheduleImpl(std::function<void ()>, Priority, std::optional, bool)::'lambda0'()>>(void) @ 0x000000000d0e55e3
  12. ? @ 0x00007f35c7ac6609
  13. ? @ 0x00007f35c79e1353 (version 24.6.3.95 (official build)) 2024.11.07 06:05:47.383082 [ 786 ] {} BaseDaemon: ########## Short fault info ############ 2024.11.07 06:05:47.383140 [ 786 ] {} BaseDaemon: (version 24.6.3.95 (official build), build id: 1C46AC978AF543A2DB896339CF1F654E731358FF, git hash: 8325c920d1102360c91b756da2eb32ae832fae8d) (from thread 707) Received signal 11 2024.11.07 06:05:47.383159 [ 786 ] {} BaseDaemon: Signal description: Segmentation fault 2024.11.07 06:05:47.383173 [ 786 ] {} BaseDaemon: Address: NULL pointer. Access: read. Address not mapped to object. 2024.11.07 06:05:47.383193 [ 786 ] {} BaseDaemon: Stack trace: 0x000000000d2d34ec 0x00007f35c7ad2420 0x000000001148480c 0x0000000011efa338 0x000000001166d857 0x000000001165f723 0x00000000116537e2 0x000000001164fd8a 0x0000000011eb3dd6 0x0000000011f07630 0x000000001139f4dd 0x000000001139cbed 0x0000000011a8e123 0x0000000011a930be 0x0000000010de5e62 0x0000000010de41a8 0x0000000010de1070 0x0000000010dd9b4e 0x0000000010df3342 0x000000000d0e55e3 0x00007f35c7ac6609 0x00007f35c79e1353 2024.11.07 06:05:47.383210 [ 786 ] {} BaseDaemon: ######################################## 2024.11.07 06:05:47.383224 [ 786 ] {} BaseDaemon: (version 24.6.3.95 (official build), build id: 1C46AC978AF543A2DB896339CF1F654E731358FF, git hash: 8325c920d1102360c91b756da2eb32ae832fae8d) (from thread 707) (query_id: 42c2b1b7-4656-4b89-8819-afa98dfba2ec) (query: / ddl_entry=query-0000007376 / ALTER TABLE dbname.stats_table_name DROP COLUMN IF EXISTS column_name1) Received signal Segmentation fault (11) 2024.11.07 06:05:47.383240 [ 786 ] {} BaseDaemon: Address: NULL pointer. Access: read. Address not mapped to object. 2024.11.07 06:05:47.383256 [ 786 ] {} BaseDaemon: Stack trace: 0x000000000d2d34ec 0x00007f35c7ad2420 0x000000001148480c 0x0000000011efa338 0x000000001166d857 0x000000001165f723 0x00000000116537e2 0x000000001164fd8a 0x0000000011eb3dd6 0x0000000011f07630 0x000000001139f4dd 0x000000001139cbed 0x0000000011a8e123 0x0000000011a930be 0x0000000010de5e62 0x0000000010de41a8 0x0000000010de1070 0x0000000010dd9b4e 0x0000000010df3342 0x000000000d0e55e3 0x00007f35c7ac6609 0x00007f35c79e1353 2024.11.07 06:05:47.383317 [ 786 ] {} BaseDaemon: 0. signalHandler(int, siginfo_t, void) @ 0x000000000d2d34ec 2024.11.07 06:05:47.383331 [ 786 ] {} BaseDaemon: 1. ? @ 0x00007f35c7ad2420 2024.11.07 06:05:47.383357 [ 786 ] {} BaseDaemon: 2. _Z11typeid_castIRKN2DB9QueryNodeENS0_14IQueryTreeNodeEQsr3stdE14is_reference_vIT_EES5RT0 @ 0x000000001148480c 2024.11.07 06:05:47.383387 [ 786 ] {} BaseDaemon: 3. DB::StorageDistributed::getQueryProcessingStage(std::shared_ptr, DB::QueryProcessingStage::Enum, std::shared_ptr const&, DB::SelectQueryInfo&) const @ 0x0000000011efa338 2024.11.07 06:05:47.383409 [ 786 ] {} BaseDaemon: 4. DB::InterpreterSelectQuery::getSampleBlockImpl() @ 0x000000001166d857 2024.11.07 06:05:47.383438 [ 786 ] {} BaseDaemon: 5. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr const&, std::shared_ptr const&, std::optional, std::shared_ptr const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator> const&, std::shared_ptr const&, std::shared_ptr)::$_0::operator()(bool) const @ 0x000000001165f723 2024.11.07 06:05:47.383463 [ 786 ] {} BaseDaemon: 6. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr const&, std::shared_ptr const&, std::optional, std::shared_ptr const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator> const&, std::shared_ptr const&, std::shared_ptr) @ 0x00000000116537e2 2024.11.07 06:05:47.383482 [ 786 ] {} BaseDaemon: 7. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr const&, std::shared_ptr const&, std::optional, std::shared_ptr const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator> const&, std::shared_ptr const&, std::shared_ptr) @ 0x000000001164fd8a 2024.11.07 06:05:47.383503 [ 786 ] {} BaseDaemon: 8. DB::IStorage::getDependentViewsByColumn(std::shared_ptr) const @ 0x0000000011eb3dd6 2024.11.07 06:05:47.383518 [ 786 ] {} BaseDaemon: 9. DB::StorageDistributed::checkAlterIsPossible(DB::AlterCommands const&, std::shared_ptr) const @ 0x0000000011f07630 2024.11.07 06:05:47.383539 [ 786 ] {} BaseDaemon: 10. DB::InterpreterAlterQuery::executeToTable(DB::ASTAlterQuery const&) @ 0x000000001139f4dd 2024.11.07 06:05:47.383558 [ 786 ] {} BaseDaemon: 11. DB::InterpreterAlterQuery::execute() @ 0x000000001139cbed 2024.11.07 06:05:47.383579 [ 786 ] {} BaseDaemon: 12. DB::executeQueryImpl(char const, char const, std::shared_ptr, DB::QueryFlags, DB::QueryProcessingStage::Enum, DB::ReadBuffer) @ 0x0000000011a8e123 2024.11.07 06:05:47.383604 [ 786 ] {} BaseDaemon: 13. DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::shared_ptr, std::function<void (DB::QueryResultDetails const&)>, DB::QueryFlags, std::optional const&, std::function<void (DB::IOutputFormat&, String const&, std::shared_ptr const&, std::optional const&)>) @ 0x0000000011a930be 2024.11.07 06:05:47.383625 [ 786 ] {} BaseDaemon: 14. DB::DDLWorker::tryExecuteQuery(DB::DDLTaskBase&, std::shared_ptr const&) @ 0x0000000010de5e62 2024.11.07 06:05:47.383644 [ 786 ] {} BaseDaemon: 15. DB::DDLWorker::processTask(DB::DDLTaskBase&, std::shared_ptr const&) @ 0x0000000010de41a8 2024.11.07 06:05:47.383659 [ 786 ] {} BaseDaemon: 16. DB::DDLWorker::scheduleTasks(bool) @ 0x0000000010de1070 2024.11.07 06:05:47.383678 [ 786 ] {} BaseDaemon: 17. DB::DDLWorker::runMainThread() @ 0x0000000010dd9b4e 2024.11.07 06:05:47.383700 [ 786 ] {} BaseDaemon: 18. void std::function::policy_invoker<void ()>::__call_impl<std::function::default_alloc_func<ThreadFromGlobalPoolImpl<true, true>::ThreadFromGlobalPoolImpl<void (DB::DDLWorker::)(), DB::DDLWorker>(void (DB::DDLWorker::&&)(), DB::DDLWorker&&)::'lambda'(), void ()>>(std::function::policy_storage const) @ 0x0000000010df3342 2024.11.07 06:05:47.383728 [ 786 ] {} BaseDaemon: 19. void std::thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::thread_struct, std::default_delete>, void ThreadPoolImpl::scheduleImpl(std::function<void ()>, Priority, std::optional, bool)::'lambda0'()>>(void) @ 0x000000000d0e55e3 2024.11.07 06:05:47.383739 [ 786 ] {} BaseDaemon: 20. ? @ 0x00007f35c7ac6609 2024.11.07 06:05:47.383749 [ 786 ] {} BaseDaemon: 21. ? @ 0x00007f35c79e1353 2024.11.07 06:05:47.528329 [ 786 ] {} BaseDaemon: Integrity check of the executable successfully passed (checksum: C1D073B23162165ABDAAD18CE5500AD7) 2024.11.07 06:05:47.528557 [ 786 ] {} BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues 2024.11.07 06:05:47.528689 [ 786 ] {} BaseDaemon: Changed settings: connect_timeout_with_failover_ms = 1000, distributed_aggregation_memory_efficient = true, log_queries = true, parallel_view_processing = true, max_partitions_per_insert_block = 5000, default_database_engine = 'Ordinary' 2024.11.07 06:05:48.598429 [ 587 ] {} dbname.alerts_script_version_info.DistributedInsertQueue.default: Code: 279. DB::NetException: All connection tries failed. Log:

=========

karthik-thiyagarajan commented 3 weeks ago

For now, i've deleted the query from the task queue and now am able to bring up the cluster. But ideally am still not sure why a single drop column statement causes the entire server to go down.

den-crane commented 3 weeks ago

do you have VIEW or MaterializedView which is querying this table?

karthik-thiyagarajan commented 3 weeks ago

Yes, we have one mview accessing this table.

den-crane commented 3 weeks ago

Can you please be kind and share create table & create materialized view & alter add column & alter drop column ?

karthik-thiyagarajan commented 3 weeks ago

============================================================================================

CREATE MATERIALIZED VIEW av.mv_stats_aggregated TO av.stats_aggregated ( datapoint_timestamp DateTime, tid String, sid String, bkd Bool, ail LowCardinality(String), anl LowCardinality(String), ac String, an String, wc String, wn String, are Enum8('NA' = 0, 'TW' = 1, 'L' = 2, 'M' = 3, 'S' = 4, 'H' = 5), ahe String, stb UInt64, srb UInt64, all AggregateFunction(avg, UInt32), awl AggregateFunction(avg, UInt32), aq AggregateFunction(avg, UInt8), ucm AggregateFunction(uniq, String) ) AS SELECT toStartOfHour(timestamp) AS datapoint_timestamp, tid, sid, bkd, ail, anl, ac, acn, wc, wn, are, multiIf((usl = 1) AND (utl = 1), 'usatl', (usl = 1) AND (utl = 0), 'usl', (usl = 0) AND (utl = 1), 'utl', 'undefined') AS ahe, sum(tb) AS stb, sum(rb) AS srb, avgState(sart) AS all, avgState(cart) AS awl, avgState(qoe) AS aq, uniqState(cm) AS ucm FROM av.av_stats_all_v3 WHERE ((bkd = true) OR (rb > 0) OR (tb > 0)) AND ((usl = 1) OR (utl = 1)) AND (rb < 1125899906842624) AND (tb < 1125899906842624) GROUP BY toStartOfHour(timestamp), tid, sid, bkd, ail, anl, ac, acn, wc, wn, are, ahe

============================================================================================ CREATE TABLE av.av_stats_all_v3 ( timestamp DateTime64(3), s String, tid String, sid String, cm String, si String, sp String, di String, dp String, ac String, wc String, wup String, ur String, sn String, bkd Bool, pf Bool, attgf Bool, lp UInt64, rp UInt64, rb UInt64, tb UInt64, rtm DateTime64(3), sdm String, dtm DateTime64(3), dm String, bid String, un String, vi UInt64, at UInt32, as Bool, vhs String, vhd String, vts Array(String), vtd Array(String), cmd String, acn String, drr String, rdu String, wcn String, usl UInt8 DEFAULT 1, utl UInt8 DEFAULT 1, it UInt32, vp UInt32, st String, dt String, vis String, vid String, vns String, vnd String, car UInt32, sar UInt32, tllp UInt32, twlp UInt32, qoe UInt8, cn String, si4 String, si6 String, dp String, vn String, dn String, ced String, tsv String, adht Enum8('ADHTU' = 0, 'ADHTPB' = 1, 'ADHTPT' = 2) DEFAULT 0, prt UInt8, fhdn String, fhds String, are Enum8('Not_Evaluated' = 0, 'TW' = 1, 'L' = 2, 'M' = 3, 'S' = 4, 'H' = 5) DEFAULT 0, wcnl LowCardinality(String), casl LowCardinality(String), cadl LowCardinality(String), mte Enum8('UNKNOWN' = 0, 'GPMT' = 1, 'APMT' = 2, 'AEMT' = 3) DEFAULT 0, sccte Enum8('CTU' = 0, 'CTW' = 1, 'CTW' = 2) DEFAULT 0, ail LowCardinality(String), acl LowCardinality(String), wcl LowCardinality(String), onl LowCardinality(String), mvl LowCardinality(String), cl LowCardinality(String), col LowCardinality(String), cdail LowCardinality(String), anl LowCardinality(String), acnl LowCardinality(String), dvte Enum8('DTU' = 0, 'DTAP' = 1, 'DTS' = 2, 'DTG' = 3) DEFAULT 0, dpte Enum8('DTU' = 0, 'DTI' = 1, 'DTIVC' = 2, 'DTC2C' = 3) DEFAULT 0, re Enum8('Unknown' = 0, 'TW' = 1, 'L' = 2, 'M' = 3, 'S' = 4, 'H' = 5) DEFAULT 0, ccl LowCardinality(FixedString(2)), cfl LowCardinality(String), cart UInt32, sart UInt32, sn String, fhdsi String ) ENGINE = Distributed('{cluster}', 'av', 'av_local', rand())

============================================================================================

I have obfuscated the column names. We also have a local table of the same spec as that of distributed table given above and its a replicated merge tree table. Let me know if this helps

============================================================================================

den-crane commented 3 weeks ago

We tried to add a column that already exists and dropped the same column

And you run something like alter table av.av_stats_all_v3 add column s String it failed, then you executed alter table av.av_stats_all_v3 drop column s? Or have you altered the local table ?

karthik-thiyagarajan commented 3 weeks ago

I've a local table and a distributed table. I also have a mview on top of this, like what i gave above. I ran the drop column on local table first and it dropped the columns successfully. Then i dropped the same column on the distributed table and thats when it dint drop the column but rather i got this pod ran into crashloop back off with the above error.

den-crane commented 3 weeks ago

@karthik-thiyagarajan you have duplicate in the enum 'CTW' = 1, 'CTW' = 2. Is it a typo during the obfuscation?

karthik-thiyagarajan commented 3 weeks ago

Yes, sorry for that. There could have been some typos while obfuscating the colum names. Let me know if you want me to fix it and give a clean ddl.

den-crane commented 2 weeks ago

I tried to reproduce it with your schema, but I failed. Everything works for me.

karthik-thiyagarajan commented 2 weeks ago

Can you confirm if you followed the below steps and still see that its working ?

  1. create local table
  2. create dist table (with col1, col2 ... col 10)
  3. create mview with my script
  4. create a parallel process that keeps inserting the records / pushing the records to dist table at constant pace
  5. while the records are being pushed - drop the column in local table and then followed by dist table and see if it works.

This is our scenario when we saw the failure.