Freeing Scroll Context can Result in the Store Getting Closed on a Transport Thread

original-brownbear commented 2 years ago

I looked into transport worker slow logging in 7.17 and one transport action and one outstanding and recurring issue is logs like the below:

[instance-0000000000] handling inbound transport message [InboundMessage{Header{1325}{7.17.0}{241879}{true}{false}{false}{false}{indices:data/read/search[free_context/scroll]}}] took [6004ms] which is above the warn threshold of [5000ms]

I believe this is caused by the fact that the underlying action decrements the store ref count. If it turns out to be the lat to decrement the ref count here, then that leads to the closing (including acquiring the shard lock) to run on a transport thread.

I think this is always a I think this can only happen (but happens quite a bit in Cloud logs) if there's a concurrent relocation or so but regardless IO should never run on transport workers.

I wonder if we may have other spots where this occurs and the last decrement for the store hits via a search action on a transport thread. It might be worth adding an assertion for not running the store close on a transport worker when fixing this.

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

benwtrent commented 2 months ago

@original-brownbear does this bug still exist? I know we have made progress on moving & detecting things that could block transport thread pools.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)

original-brownbear commented 2 months ago

@benwtrent I fixed most but not all cases. There's still some corner cases left around waiting for refresh and one other thing, but let me try to fix this now, I might have a short fix.

elastic / elasticsearch

Freeing Scroll Context can Result in the Store Getting Closed on a Transport Thread #83515