Open ankitsultana opened 1 year ago
@jadami10 @priyen This is similar to the issue you run into
@Jackie-Jiang yes, this and force commit's should manifest similarly
Just for clarity, our particular issue was due to the IS update lock
The PR #11679 removes the FSM lock: did you find that to be the bottleneck for some use-case?
The FSM lock isn't technically global and uses striping (num-locks=20) with segmentName as the hash key. Also the ops done under the lock should be relatively quite quick, so wondering if there'll be much improvement (since IS update will continue to be the bottleneck).
It's been a while since I looked into this so I might need to take a deeper look if more context is needed.
@ankitsultana I'll use this ticket to track a series of improvements. The thread dump you took above is actually blocking on creating the FSM, which is solved with #11679. I referred it as global because the locks are actually cross table, and segment from different table can block each other.
We are seeing an issue with one of our high ingestion throughput tables where ingestion lag continues to increase because of a table-level lock in Pinot controller.
https://github.com/apache/pinot/blob/723b764bc91275c0b8361d3f9135f151b6404c39/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java#L594
@Jackie-Jiang I think we briefly discussed this yesterday. I don't have enough context about this, but one potential solution could be to add some jitter in the number of docs in a segment at partition level so these events arrive at slightly different times. But I don't think that is tenable and we may need a more proper fix.
Also, there is a lock being taken in logback which is causing a lot of contention (there were 10 threads blocked in a sample thread dump I took). That seems like a simpler fix which we may wanna do anyways.