Open chenboat opened 3 years ago
@jackjlli and @mcvsubbu I wonder if you have encountered the above deadlock issue in your Prod clusters.
The default pool size for HelixTaskExecutor shouldn't be just 1. I checked the Helix code and the default value is 40:
https://github.com/apache/helix/blob/78f6c8beaf05302703809dd7de231605ff4d74a3/helix-core/src/main/java/org/apache/helix/messaging/handling/HelixTaskExecutor.java#L115
Please adjust the STATE_TRANSITION.maxThreads
in CONFIGS/CLUSTER/pinot
ZNode and restart all the pinot components sequentially to correctly pick up your new setting (controller first, then broker, server, etc).
In Pinot 1.0, we used to have a logic that the value of STATE_TRANSITION.maxThreads
is set to 1 if this config is missing in the cluster. This will cause issue as this default value 1 will be used as the thread pool size, which will make Pinot cluster extremely slow.
If you try to upgrade Pinot cluster from 1.0 to a higher version, please keep in mind that STATE_TRANSITION.maxThreads
be set to a higher number instead of 1 in CONFIGS/CLUSTER/pinot
Znode before rolling out to higher Pinot version.
Alternatively, just remove STATE_TRANSITION.maxThreads
, and the cluster will use the Helix default value of 40
Thanks. The issue was resolved by now. But it will be great to explicitly exposed this config to Pinot installation. It is also worthwhile to fix the deadlock when maxThreads == 1.
It is useful add some notes here on this.
The reason the semaphore was put in place was because Kafka consumers had an issue with dynamically creating new kafka metrics when a new consumer was created on the same partition while keeping the old consumer open. This led to metrics proliferation.
From kafka (or any other stream) point of view, if they are trying to emit per-partition consumer metrics, it probably makes sense that they add different metrics for each consumer so as to get the health of the system correctly. So, we need to handle it somehow.
One way could be that when we implement an internal "queue" of state transitions, holding the consuming transition until the ONLINE transition is received. Of course, we will have to respond back OK for the consuming state transition, but just not allow our software to handle it. A little tricky to implement, and sounds somewhat hacky to me. Also has a downside that we will be responding OK without actually handling the state transition. Helix does not provide a way to mark a transition in ERROR from the participant. We could ask for that feature (useful for other things as well), and then we can mark the consumer in ERROR state if it later fails to transition correctly.
A simple work-around in the short/medium term may be we stop realtime table creation in the controller if the max-threads is not set to at least 2 (we don't expect to have more than one outstanding state transition for the same partition).
I am open to hearing other ideas on how to fix the deadlock.
We found during our testing that Pinot LLC realtime consumer stopped ingestion due to out of order delivery of Helix state transition events for consecutive segments of a partition. The detailed log is attached.
Note that OFFLINE->CONSUMING for _terra_lever_events08820200901T2252Z_ was delivered slightly before CONSUMING->ONLINE for _terra_lever_events08720200901T1652Z_.
For _terra_lever_events08820200901T2252Z_ to get to the consuming state, HelixTaskExecutor needs to acquire the semaphore for the partition. However the semaphore is hold by the _terra_lever_events08720200901T1652Z_ which can not be released until HelixTaskExecutor executes its CONSUMING->ONLINE transition (which is next in the task queue). As a result, a deadlock occurred and the LLC stop ingestions. In fact the server stopped ingestion for all tables because HelixTaskExecutor has java.util.concurrent.ThreadPoolExecutor default pool size = 1 and active threads = 1.
@jackjlli @mcvsubbu