Open original-brownbear opened 2 years ago
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-distributed (Team:Distributed)
@original-brownbear does this bug still exist? I know we have made progress on moving & detecting things that could block transport thread pools.
Pinging @elastic/es-search-foundations (Team:Search Foundations)
@benwtrent I fixed most but not all cases. There's still some corner cases left around waiting for refresh and one other thing, but let me try to fix this now, I might have a short fix.
I looked into transport worker slow logging in 7.17 and one transport action and one outstanding and recurring issue is logs like the below:
I believe this is caused by the fact that the underlying action decrements the store ref count. If it turns out to be the lat to decrement the ref count here, then that leads to the closing (including acquiring the shard lock) to run on a transport thread.
I think this is always a I think this can only happen (but happens quite a bit in Cloud logs) if there's a concurrent relocation or so but regardless IO should never run on transport workers.
I wonder if we may have other spots where this occurs and the last decrement for the store hits via a search action on a transport thread. It might be worth adding an assertion for not running the store close on a transport worker when fixing this.