StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.26k stars 1.67k forks source link

CN repeatedly crash in shared-data mode after upgrading to v3.1.9 #47316

Open yeyudefeng opened 1 week ago

yeyudefeng commented 1 week ago

k8s部署starrocks,3fe + 3cn 存算分离 腾讯的eks(k8s) + cos对象存储 + cfs挂载日志和cn缓存

最开始部署3.1.7,后来升级到3.1.8,中间反复验证升级回退。 然后从另外的集群3.1.7导数(官方smt迁移工具)到这个集群3.1.8。

当时官方工具有一个cornor case,有大佬帮忙解决重新打了一个smt的包,迁移完成了。

后来3.1.8以下有一个bug,在3.1.9版本解决了,因此升级到了3.1.11,离线数仓任务(其中4个任务稳定报错如下,其他任务正常),反复升级回退验证3.1.9,3.1.11,3.1.12版本,均稳定复现。中间回退3.1.7,3.1.8又正常运行。

后来3.1.11尝试删掉有问题的task涉及的所有表,重新导数,一样会有这个问题。

注:k8s fe升级的时候不能保证最后重启leader操作

java.lang.RuntimeException: Internal error. Detail: deserialize chunk data failed. column slot id: 86, column row count: 18446744073709551615, expected row count: 200. There is probably a bug here.

java.lang.RuntimeException: Runtime error: encode size does not equal when decoding, encode size = 4294967296, but decode get size = 1662, raw size = 2444.

只要一出上面的异常,就会导致cn挂掉,服务间断重启,貌似不升级的时候是没问题的。不知道是不是升级的时候导致的问题.

cn 日志: start time: Wed Jun 19 08:48:14 CST 2024 3.1.11 RELEASE (build 34f131b) query_id:2f13746c-2dd6-11ef-893b-525400e32b8f, fragment_instance:2f13746c-2dd6-11ef-893b-525400e32ba9 tracker:process consumption: 5996113600 tracker:query_pool consumption: 3761682504 tracker:load consumption: 0 tracker:metadata consumption: 6561182 tracker:tablet_metadata consumption: 65995 tracker:rowset_metadata consumption: 0 tracker:segment_metadata consumption: 1017435 tracker:column_metadata consumption: 5477752 tracker:tablet_schema consumption: 65995 tracker:segment_zonemap consumption: 520019 tracker:short_key_index consumption: 145788 tracker:column_zonemap_index consumption: 1554712 tracker:ordinal_index consumption: 1515816 tracker:bitmap_index consumption: 0 tracker:bloom_filter_index consumption: 0 tracker:compaction consumption: 0 tracker:schema_change consumption: 0 tracker:column_pool consumption: 0 tracker:page_cache consumption: 1870804096 tracker:update consumption: 68370 tracker:chunk_allocator consumption: 69725272 tracker:clone consumption: 0 tracker:consistency consumption: 0 tracker:datacache consumption: 0 tracker:replication consumption: 0 Aborted at 1718758341 (unix time) try “date -d @1718758341” if you are using GNU date PC: @ 0x8557ace svb_decode_avx_simple SIGSEGV (@0x7fc28176c000) received by PID 27 (TID 0x7fc2f326a640) from PID 18446744071586627584; stack trace: @ 0x7cfae6a google::(anonymous namespace)::FailureSignalHandler() @ 0x7fc358cfe520 (unknown) @ 0x8557ace svb_decode_avx_simple @ 0x8557d21 streamvbyte_decode @ 0x55619ff starrocks::serde::(anonymous namespace)::decode_integers<>() @ 0x5563a5b starrocks::ColumnVisitorMutableAdapter<>::visit() @ 0x3d1001d starrocks::ColumnFactory<>::accept_mutable() @ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize() @ 0x5566ccf starrocks::ColumnVisitorMutableAdapter<>::visit() @ 0x3d638dd starrocks::ColumnFactory<>::accept_mutable() @ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize() @ 0x5566c83 starrocks::ColumnVisitorMutableAdapter<>::visit() @ 0x3d6561d starrocks::ColumnFactory<>::accept_mutable() @ 0x55669ea starrocks::serde::ColumnArraySerde::deserialize() @ 0x65620ba starrocks::serde::ProtobufChunkDeserializer::deserialize() @ 0x662074b starrocks::DataStreamRecvr::SenderQueue::_deserialize_chunk() @ 0x66249c7 starrocks::DataStreamRecvr::PipelineSenderQueue::get_chunk() @ 0x66179d6 starrocks::DataStreamRecvr::get_chunk_for_pipeline() @ 0x60a9493 starrocks::pipeline::ExchangeSourceOperator::pull_chunk() @ 0x5c7e4d0 starrocks::pipeline::PipelineDriver::process() @ 0x6508830 starrocks::pipeline::GlobalDriverExecutor::_worker_thread() @ 0x68ac92c starrocks::ThreadPool::dispatch_thread() @ 0x68a5d6a starrocks::thread::supervise_thread() @ 0x7fc358d50ac3 (unknown) @ 0x7fc358de2850 (unknown) @ 0x0 (unknown)

sql:

insert overwrite db_elderly_care_ads.ads_clique_operation_client_dap partition(part_date=‘2024’) with ac_client as ( select t.org_id , 1 public_client_cmt , case when charger_id != ‘’ or sharer_id != ‘’ then 1 else 0 end client_cmt , case when user_id != ‘’ then 1 else 0 end user_cmt , null order_cmt , null finish_order_cmt from db_elderly_care_ods.ods_tb_client_info_dap t where t.delete_status = 0 and date(t.create_time) < date_trunc(‘year’,add_months(‘2024-06-19’, 12)) ) , ac_order as ( select t.org_id , 0 public_client_cmt , 0 client_cmt , 0 user_cmt , t.client_id order_cmt , case when o.order_status = 5 then t.client_id else null end finish_order_cmt from db_elderly_care_ods.ods_tb_client_info_dap t left join db_elderly_care_ods.ods_tb_order_dap as o on t.user_id = o.customer_id where t.delete_status = 0 and o.delete_status = 0 and o.order_type = 0 and date(t.create_time) < date_trunc(‘year’,add_months(‘2024-06-19’, 12)) ) , se_vi as ( select one_level , two_level , three_level , four_level , five_level , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt from ( select from ac_client union all select from ac_order ) s join db_elderly_care_dim.dim_org_level_dap e on s.org_id = e.org_id and org_type_id in (3,4, 5, 6, 7, 8, 9) ) , ac_re as ( select one_level , two_level , three_level , four_level , five_level , sum(public_client_cmt) public_client_cmt , sum(client_cmt) client_cmt , sum(user_cmt) user_cmt , count(distinct order_cmt) order_cmt , count(distinct finish_order_cmt) finish_order_cmt from se_vi group by grouping sets ( (one_level) , (two_level) , (three_level) , (four_level) , (five_level) ) ), ac_bb as ( select case when one_level != ‘’ then one_level when two_level != ‘’ then two_level when three_level != ‘’ then three_level when four_level != ‘’ then four_level when five_level != ‘’ then five_level end org_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt from ac_re ), ac_dd as ( select org_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt , ifnull(client_cmt / public_client_cmt, 0.00) client_cmt_rate , ifnull(user_cmt / client_cmt, 0.00) user_cmt_rate , ifnull(order_cmt / user_cmt, 0.00) order_cmt_rate , ifnull(finish_order_cmt / order_cmt, 0.00) finish_order_cmt_rate from ac_bb where org_id != ‘null’ ), ac_cc as ( select b.org_id , o.org_name , o.tenant_id , public_client_cmt , client_cmt , user_cmt , order_cmt , finish_order_cmt , client_cmt_rate , user_cmt_rate , order_cmt_rate , finish_order_cmt_rate , substr(current_timestamp(),1, 19) , ‘y’ , substring(‘2024-06-19’, 1, 4) from ac_dd b join db_elderly_care_ods.ods_tb_org_dap o on b.org_id = o.org_id ) select * from ac_cc

task jdbc异常: java.lang.RuntimeException: Internal error. Detail: deserialize chunk data failed. column slot id: 151, column row count: 8, expected row count: 244. There is probably a bug here.

yeyudefeng commented 6 days ago

我试了k8s模式下的存算一体,也有复现。但是在物理机下的存算一体,就没有这个问题。难道跟fqdn有关系? 把这个sql的列数,数据量条数等减少较多的情况下,某些情况下,又没有这个问题。

yeyudefeng commented 4 days ago

我用做了一个实验,单fe + 单cn ,物理机, 存储在腾讯的cos上。如果不开启fqdn,没有这个异常,开启fqdn,有这个异常。