ByConity / ByConity

ByConity is an open source cloud data warehouse
https://byconity.github.io/
Apache License 2.0
2.24k stars 332 forks source link

Manual Disk Cache Preload Fails with “No previous part" #1885

Open ozansz opened 1 month ago

ozansz commented 1 month ago

Bug Report

Briefly describe the bug

When I run ALTER DISK CACHE PRELOAD on a single partition, it fails with an error message: DB::Exception: DB::Exception: No previous part of <part>.

The result you expected

I expected the preload to be finished without any exception.

How to Reproduce

Ran the below query to preload the partition named '2024-09-12’, with send_logs_level set to trace:

ALTER DISK CACHE PRELOAD TABLE my_database.my_table SYNC PARTITION '2024-09-12’ SETTINGS parts_preload_level = 3, virtual_warehouse = ‘my_vw', preload_send_rpc_max_ms = 600000

which printed the below logs and failed:

[byconity-server-0] 2024.10.04 09:22:20.567786 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> ServerResource: SourcePrune Send my_db.my_table dead1cef-bc59-4efd-8da6-16bba27b8a2a's data part to worker {byconity-vw-vw-default-0 10.72.249.126  rpc/8124 tcp/9000 exc/8124 exs/8124}
[byconity-server-0] 2024.10.04 09:22:20.567806 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> ServerResource: Storage my_db.`my_table` (dead1cef-bc59-4efd-8da6-16bba27b8a2a) host {byconity-vw-vw-default-0 10.72.249.126  rpc/8124 tcp/9000 exc/8124 exs/8124} prune_table false has_parts true
[byconity-server-0] 2024.10.04 09:22:20.567836 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> ServerResource: SourcePrune Send my_db.my_table dead1cef-bc59-4efd-8da6-16bba27b8a2a's data part to worker {byconity-vw-vw-default-1 10.72.150.38  rpc/8124 tcp/9000 exc/8124 exs/8124}
[byconity-server-0] 2024.10.04 09:22:20.567849 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> ServerResource: Storage my_db.`my_table` (dead1cef-bc59-4efd-8da6-16bba27b8a2a) host {byconity-vw-vw-default-1 10.72.150.38  rpc/8124 tcp/9000 exc/8124 exs/8124} prune_table false has_parts true
[byconity-server-0] 2024.10.04 09:22:20.568070 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> my_db.`my_table` (dead1cef-bc59-4efd-8da6-16bba27b8a2a) (CnchMergeTree): send preload data parts size = 40, enable_parts_sync_preload = true, parts_preload_level = 3, submit_ts = 1728033740, time_ms = 3
[byconity-server-0] 2024.10.04 09:22:20.568215 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> my_db.`my_table` (dead1cef-bc59-4efd-8da6-16bba27b8a2a) (CnchMergeTree): send preload data parts size = 39, enable_parts_sync_preload = true, parts_preload_level = 3, submit_ts = 1728033740, time_ms = 3
[byconity-server-0] 2024.10.04 09:23:48.208895 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Debug> TransactionCoordinator: Deleted txn 452993676780765285
[byconity-server-0] 2024.10.04 09:23:48.208940 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Trace> TransactionCoordinator: No need to erase active timestamp for txn 452993676780765285
[byconity-server-0] 2024.10.04 09:23:48.209009 [ 256931 ] {6e695432-277e-4a49-be9b-63886d066ae0} <Error> executeQuery: Code: 49, e.displayText() = DB::Exception: DB::Exception: No previous part of 20240912_452493706844177029_452494136926011393_6_452615359823348074 SQLSTATE: HY000. SQLSTATE: HY000 (version 21.8.7.1) (from [::1]:60048) (in query: ALTER DISK CACHE PRELOAD TABLE my_db.`my_table` PARTITION '2024-09-12' SYNC SETTINGS parts_preload_level = 3, virtual_warehouse = 'my_vw', preload_send_rpc_max_ms = 600000), Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x275e5e12 in /opt/byconity/bin/clickhouse
1. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x10c2afc0 in /opt/byconity/bin/clickhouse
2. DB::readException(DB::ReadBuffer&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool) @ 0x10c9e851 in /opt/byconity/bin/clickhouse
3. DB::RPCHelpers::checkException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 0x208d8dcb in /opt/byconity/bin/clickhouse
4. void DB::RPCHelpers::onAsyncCallDone<DB::Protos::PreloadDataPartsResp>(DB::Protos::PreloadDataPartsResp*, brpc::Controller*, std::__1::shared_ptr<DB::ExceptionHandler>) @ 0x206ae0e9 in /opt/byconity/bin/clickhouse
5. brpc::internal::FunctionClosure3<DB::Protos::PreloadDataPartsResp*, brpc::Controller*, std::__1::shared_ptr<DB::ExceptionHandler> >::Run() @ 0x206b01ef in /opt/byconity/bin/clickhouse
6. /home/ByConity/build_docker/../contrib/incubator-brpc/src/brpc/controller.h:716: brpc::Controller::EndRPC(brpc::Controller::CompletionInfo const&) @ 0x27341fa9 in /opt/byconity/bin/clickhouse
7. /home/ByConity/build_docker/../contrib/incubator-brpc/src/brpc/controller.cpp:0: brpc::Controller::OnVersionedRPCReturned(brpc::Controller::CompletionInfo const&, bool, int) @ 0x27340b16 in /opt/byconity/bin/clickhouse
8. /home/ByConity/build_docker/../contrib/incubator-brpc/src/brpc/details/controller_private_accessor.h:0: brpc::policy::ProcessRpcResponse(brpc::InputMessageBase*) @ 0x2736a070 in /opt/byconity/bin/clickhouse
9. /home/ByConity/build_docker/../contrib/incubator-brpc/src/brpc/input_messenger.cpp:386: brpc::InputMessenger::OnNewMessages(brpc::Socket*) @ 0x273608ce in /opt/byconity/bin/clickhouse
10. /home/ByConity/build_docker/../contrib/libcxx/include/memory:1655: brpc::Socket::ProcessEvent(void*) @ 0x27490d0d in /opt/byconity/bin/clickhouse
11. /home/ByConity/build_docker/../contrib/incubator-brpc/src/bthread/task_group.cpp:304: bthread::TaskGroup::task_runner(long) @ 0x2732e10d in /opt/byconity/bin/clickhouse

0 rows in set. Elapsed: 88.027 sec.

Received exception from server (version 21.8.7):
Code: 49. DB::Exception: Received from localhost:9000. DB::Exception: DB::Exception: No previous part of 20240912_452493706844177029_452494136926011393_6_452615359823348074 SQLSTATE: HY000. SQLSTATE: HY000.
  1. How does the preload process work? Does the server downloads the part data from S3 and sends them to the read workers, or vice versa?
  2. How can I debug that part don’t have a previous part issue in server/workers/FDB?
  3. Where can this issue be originating from, could the part data be corrupted in FDB or the part itself?

Thanks in advance.

foreverneverer commented 1 month ago
  1. run alter disk cache xxxx will trigger server fetch necessary part info from fdb and send to the table's vw_read to load parts to local vw-read disk
  2. I can't find the log-code line in server side, mayby it response from vw-read, you need breakpoint the error log line on server and vw-read to make sure the detail stack
  3. do you modify the column data? it is almost a bug but not corrupted
smmsmm1988 commented 1 month ago

@ozansz For question 2, you can: 1) connect to fdb and list meta with prefix PT_{table_uuid}_20240912_452493706844177029452494136926011393; 2) grep prefix 20240912_452493706844177029452494136926011393 in logs of server to see the part change history.