Abnormal CPU usage of BE node

Ruees commented 1 year ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

1.2.6

What's Wrong?

There are three BE nodes in a resource group, and there have been Flink cdc tasks performing MySQL database synchronization doris. The CPU usage of two BE nodes is around 20%, but one BE node has a CPU usage rate of up to 90%. It should not be executing a Comparison because this state has been ongoing for a day. After performing stack tracing on this BE node, I obtained the following information

2023-11-01 09:52:55 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.351-b10 mixed mode):

"Attach Listener" #12 daemon prio=9 os_prio=0 tid=0x00007effc2c79800 nid=0x1fb212 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f00993a8800 nid=0x1fac8b runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C1 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007f0099129000 nid=0x1fac8a waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f0099128000 nid=0x1fac89 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f00efe1f000 nid=0x1fac88 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f00993a8000 nid=0x1fac87 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f00993a7000 nid=0x1fac86 in Object.wait() [0x00007f0095875000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

waiting on <0x000000056ab08f08> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:150)
locked <0x000000056ab08f08> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:171) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:188)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f00eedd8800 nid=0x1fac85 in Object.wait() [0x00007f0095976000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method)

waiting on <0x000000056ab06ba0> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:502) at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
locked <0x000000056ab06ba0> (a java.lang.ref.Reference$Lock) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00007f00eae8e000 nid=0x1fac79 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"VM Thread" os_prio=0 tid=0x00007f00ed8e2800 nid=0x1fac84 runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f00ed8df000 nid=0x1fac7e runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f00ed8e0000 nid=0x1fac7f runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f00ed8e0800 nid=0x1fac80 runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f00ed8e1000 nid=0x1fac81 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007f00ed8e3000 nid=0x1fac8c waiting on condition

JNI global references: 351

What You Expected?

Identify the cause of high CPU usage on BE node and how to solve it

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Ruees commented 1 year ago

Some abnormal information was found in be. info

1101 03:12:21.129379 2076520 task_worker_pool.cpp:725] failed to publish version|signature=5787540|transaction_id=5787540|error_tablets_num=100|error=[E-3115] I1101 03:12:21.129380 2076527 task_worker_pool.cpp:693] task elapsed 11 seconds since it is inserted to queue, it is timeout W1101 03:12:21.129392 2076527 task_worker_pool.cpp:725] failed to publish version|signature=5787551|transaction_id=5787551|error_tablets_num=100|error=[E-3115] I1101 03:12:21.129451 2076521 task_worker_pool.cpp:693] task elapsed 11 seconds since it is inserted to queue, it is timeout

LemonLiTree commented 1 year ago

Is it importing the mow table?

tapomoyadhikari commented 1 year ago

The thread dump you provided doesn't contain detailed information about the specific Flink CDC task or your application's code. However, I can offer some general guidance on how to approach the issue of high CPU usage on one of the BE nodes running Flink CDC tasks:

Analyze the High CPU Thread:
- You'll need to identify which thread or process within your Flink CDC task is causing the high CPU usage. This requires more detailed information about the threads and their activity.
- Use a tool like jstack, jvisualvm, or other profiling tools to capture thread dumps and gain insights into what the high CPU thread is doing. This will help you pinpoint the exact issue.
Possible Causes of High CPU Usage:
- Inefficient code: Review the code of your Flink CDC task to ensure it's optimized and not causing unnecessary CPU load.
- Data volume: High data volumes being processed by the task can lead to high CPU usage.
- Resource contention: Check if there are resource contention issues, such as locks, that are causing threads to wait and consume CPU.
Check Flink Configuration:
- Review the Flink configuration parameters, such as parallelism, to ensure they are set appropriately for your task.
MySQL and Doris Synchronization:
- The high CPU usage may be related to the MySQL and Doris synchronization process. Ensure that the synchronization process is configured correctly and efficiently.
Monitoring:
- Set up monitoring tools like Prometheus, Grafana, or other monitoring solutions to gain insights into the performance of your Flink CDC tasks.
Scale Out:
- If the high CPU usage is due to high data volumes, consider scaling out your Flink CDC task to distribute the load across multiple BE nodes.
Optimization:
- Profile and optimize your code, identify bottlenecks, and make necessary improvements.
Fine-Tuning:
- Fine-tune Flink's configuration settings based on your specific workload and requirements.
Updates and Patches:
- Ensure that you are using the latest versions of Flink and other components, and apply any relevant updates or patches.
Consult Documentation and Community:
- Refer to the documentation for Flink and your synchronization tools for best practices and troubleshooting guidance.
- Seek help from the Flink and Doris communities or support channels for more specific assistance.

Without more detailed information, it's challenging to pinpoint the exact cause of the high CPU usage. You may need to investigate the application further and monitor its behavior to identify and resolve the issue. Additionally, consider involving your development and operations teams to collaborate on debugging and optimizing the system.

apache / doris