ByConity / ByConity

ByConity is an open source cloud data warehouse
https://byconity.github.io/
Apache License 2.0
2.18k stars 326 forks source link

woker node Error #1703

Open OnePainter opened 2 months ago

OnePainter commented 2 months ago

I installed the physical machine on three nodes according to the physical machine deployment guide (https://byconity.github.io/zh-cn/docs/deployment/package-deployment) on the official website. byconity-tso.service, byconity-server.service, byconity-resource-manager.service, and byconity-daemon-manager.service are installed on the same node. byconity-worker.service is installed on one node, and byconity-worker-write.service is installed on another node. However, the following exception occurs on both worker nodes:

2024.06.13 10:22:53.981851 [ 24632 ] {} ResourceReporterTask: void DB::ResourceManagement::ResourceReporterTask::run(): Code: 7114, e.displayText() = DB::Exception: The leader from election result not work well SQLSTATE: HY000, Stack trace (when copying this message, always include the lines below):

  1. Poco::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int) @ 0x27227652 in /usr/bin/clickhouse
  2. DB::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int, bool) @ 0x10780640 in /usr/bin/clickhouse
  3. DB::ResourceManagement::ResourceManagerClient::registerWorker(DB::ResourceManagement::WorkerNodeResourceData const&) @ 0x2029f9cf in /usr/bin/clickhouse
  4. DB::ResourceManagement::ResourceReporterTask::sendRegister() @ 0x202c3e0a in /usr/bin/clickhouse
  5. DB::ResourceManagement::ResourceReporterTask::run() @ 0x202c3958 in /usr/bin/clickhouse
  6. DB::BackgroundSchedulePoolTaskInfo::execute() @ 0x2054877e in /usr/bin/clickhouse
  7. DB::BackgroundSchedulePool::threadFunction() @ 0x2054ab27 in /usr/bin/clickhouse
  8. void std::1::function::policy_invoker<void ()>::__call_impl<std::1::function::default_alloc_func<ThreadFromGlobalPool::ThreadFromGlobalPool<DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, unsigned long, char const, std::__1::shared_ptr)::$_1>(DB::BackgroundSchedulePool::BackgroundSchedulePool(unsigned long, unsigned long, char const, std::1::shared_ptr)::$_1&&)::'lambda'(), void ()> >(std::1::function::policy_storage const*) @ 0x2054b337 in /usr/bin/clickhouse
  9. ThreadPoolImpl::worker(std::1::list_iterator<std::__1::thread, void*>) @ 0x107bcb80 in /usr/bin/clickhouse
  10. void std::1::thread_proxy<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete >, void ThreadPoolImpl<std::1::thread>::scheduleImpl(std::1::function<void ()>, int, std::1::optional)::'lambda0'()> >(void) @ 0x107c0ffa in /usr/bin/clickhouse
  11. start_thread @ 0x7fa3 in /usr/lib/x86_64-linux-gnu/libpthread-2.28.so
  12. __clone @ 0xf906f in /usr/lib/x86_64-linux-gnu/libc-2.28.so
OnePainter commented 2 months ago

byconity version 0.4.1

frankye1982 commented 2 months ago

For analysis and troubleshooting, need the following information:

  1. Logs from the resource manager at the relevant time points.
  2. Configuration file contents of cnch_config.xml, byconity-resource-manager.xml, and byconity-worker.xml, as this is deployed on physical machines.
OnePainter commented 2 months ago

new problem when i insert data Received exception from server (version 21.8.7): Code: 1000. DB::Exception: Received from localhost:9010. DB::Exception: Access to file denied: Permission denied: user=clickhouse, access=WRITE, inode="/":root:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:506) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:346) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermissionWithContext(FSPermissionChecker.java:370) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:240) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1943) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1927) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1886) at org.apache.hadoop.hdfs.server.namenode.FSDirMkdirOp.mkdirs(FSDirMkdirOp.java:60) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3438) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:1166) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:742) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) SQLSTATE: HY000.

kevinthfang commented 2 months ago

looks like some issue with HDFS write access. Did you install the HDFS yourself? did you run the script in the installation guide to create the users?

OnePainter commented 2 months ago

There is no handle /metrics

Use / or /ping for health checks. Or /replicas_status for more sophisticated health checks.

Send queries from your program with POST method or GET /?query=...

Use clickhouse-client:

For interactive data analysis: clickhouse-client

For batch query processing: clickhouse-client --query='SELECT 1' > result clickhouse-client < query > result

OnePainter commented 2 months ago

Code: 115, e.displayText() = DB::Exception: Setting match[] is neither a builtin setting nor started with the prefix 'SQL_' registered for user-defined settings SQLSTATE: 42000 (version 21.8.7.1)

OnePainter commented 2 months ago

my global config

global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting: alertmanagers:

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

- "cnch-metrics-.yaml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

OnePainter commented 2 months ago

2024.06.17 21:27:10.824006 [ 333 ] {} void DB::StorageElector::doFollowerCheck(): Code: 1031, e.displayText() = DB::Exception: FDB error : Operation aborted because the transaction timed out SQLSTATE: HY000, Stack trace (when copying this message, always include the lines below):

  1. Poco::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int) @ 0x27227652 in /usr/bin/clickhouse
  2. DB::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int, bool) @ 0x10780640 in /usr/bin/clickhouse
  3. DB::Catalog::MetastoreFDBImpl::check_fdb_op(int const&) @ 0x200bf96d in /usr/bin/clickhouse
  4. DB::Catalog::MetastoreFDBImpl::get(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, std::1::basic_string<char, std::__1::char_traits, std::1::allocator >&) @ 0x200bfcb5 in /usr/bin/clickhouse
  5. DB::StorageElector::doFollowerCheck() @ 0x22d7ca6c in /usr/bin/clickhouse
  6. void std::1::function::policy_invoker<void ()>::__call_impl<std::1::function::default_alloc_func<ThreadFromGlobalPool::ThreadFromGlobalPool<DB::StorageElector::start()::$_0>(DB::StorageElector::start()::$_0&&)::'lambda'(), void ()> >(std::1::function::__policy_storage const*) @ 0x22d7e220 in /usr/bin/clickhouse
  7. ThreadPoolImpl::worker(std::1::list_iterator<std::__1::thread, void*>) @ 0x107bcb80 in /usr/bin/clickhouse
  8. void std::1::thread_proxy<std::1::tuple<std::1::unique_ptr<std::1::thread_struct, std::1::default_delete >, void ThreadPoolImpl<std::1::thread>::scheduleImpl(std::1::function<void ()>, int, std::1::optional)::'lambda0'()> >(void) @ 0x107c0ffa in /usr/bin/clickhouse
  9. start_thread @ 0x7fa3 in /usr/lib/x86_64-linux-gnu/libpthread-2.28.so
  10. __clone @ 0xf906f in /usr/lib/x86_64-linux-gnu/libc-2.28.so (version 21.8.7.1)
OnePainter commented 2 months ago

org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough replicas was chosen. Reason: {NO_REQUIRED_STORAGE_TYPE=1}

OnePainter commented 2 months ago

Code: 159. DB::Exception: Received from localhost:9010. DB::Exception: Query 949e1ba9-2d15-4673-9587-e9c51b6f3574 receive data timeout, maybe you can increase settings max_execution_time. Debug info for source ExchangeSource: MultiPathReceiver[1_0_00]: Try pop receive collector for MultiPathReceiver[1_0_00] timeout at 2024-06-18 08:39:56: While executing ExchangeSource: MultiPathReceiver[1_0_0_0_10.64.1.188:8124] SQLSTATE: HY000.

Timeout exceeded while receiving data from server. Waited for 300 seconds, timeout is 300 seconds. Cancelling query.

OnePainter commented 2 months ago

fdb> status

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `fdb_runtime/config/fdb.cluster'.

Unable to start default priority transaction after 5 seconds.

Unable to start batch priority transaction after 5 seconds.

Unable to retrieve all status information.

Configuration: Redundancy mode - double Storage engine - ssd-2 Coordinators - 3 Usable Regions - 1

Cluster: FoundationDB processes - 12 Zones - 3 Machines - 3 Memory availability - 8.0 GB per process on machine with least available Fault Tolerance - 1 machines Server time - 06/20/24 16:39:17

Data: Replication health - Healthy Moving data - 0.000 GB Sum of key-value sizes - 6 MB Disk space used - 641 MB

Operating space: Storage server - 0.0 GB free on most full server Log server - 0.0 GB free on most full server

Workload: Read rate - 4 Hz Write rate - 0 Hz Transactions started - 2 Hz Transactions committed - 0 Hz Conflict rate - 0 Hz Performance limited by process: Log server running out of space (approaching 100MB limit). Most limiting process: xxxxxxxx:4501

Backup and DR: Running backups - 0 Running DRs - 0

Client time: 06/20/24 16:39:09

OnePainter commented 2 months ago

2024.06.21 13:11:14.537606 [ 32111 ] {} CnchWorkerService: auto DB::CnchWorkerServiceImpl::sendResources(google::protobuf::RpcController , const Protos::SendResourcesReq , Protos::SendResourcesResp , google::protobuf::Closure )::(anonymous class)::operator()() const: Code: 57, e.displayText() = DB::Exception: Table TESTDB.TEST already exists. SQLSTATE: 42P07, Stack trace (when copying this message, always include the lines below):

  1. Poco::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int) @ 0x26f45d32 in /usr/bin/clickhouse
  2. DB::Exception::Exception(std::1::basic_string<char, std::__1::char_traits, std::1::allocator > const&, int, bool) @ 0x105d16e0 in /usr/bin/clickhouse
  3. DB::CnchWorkerResource::executeCreateQuery(std::1::shared_ptr, std::__1::basic_string<char, std::1::char_traits, std::__1::allocator > const&, bool, DB::ColumnsDescription const&) @ 0x1ff33093 in /usr/bin/clickhouse
  4. DB::CnchWorkerServiceImpl::sendResources(google::protobuf::RpcController, DB::Protos::SendResourcesReq const, DB::Protos::SendResourcesResp, google::protobuf::Closure)::$_13::operator()() const @ 0x1ff20c80 in /usr/bin/clickhouse
  5. ThreadPoolImpl::worker(std::1::list_iterator<ThreadFromGlobalPool, void*>) @ 0x106114f1 in /usr/bin/clickhouse
OnePainter commented 2 months ago

SELECT * FROM TEST LIMIT 100 SETTINGS enable_optimizer_fallback = 0

Query id: f811e6d1-2cec-45ec-a610-e9b303257361

0 rows in set. Elapsed: 0.095 sec.

Received exception from server (version 21.8.7): Code: 2007. DB::Exception: Received from localhost:9010. DB::Exception: send plan segment async failed error code : 2001 error worker : 10.64.1.101:8124 error text : [E2001][10.64.1.101:123456789]Code: 516, e.displayText() = DB::Exception: dbadmin: Authentication failed: password is incorrect or there is no user with such name SQLSTATE: HY000,