Open Zeyu-Chen-SFDC opened 4 days ago
Here is the flamegraph from periodic jstack captures of the broker query thread:
did this query override the timeout by any chance?
did this query override the timeout by any chance?
As far as I could ascertain, no. I examined the corresponding SqlQueryPlus.queryContext
object in the heap and did not find any timeout override.
QueryRunner and QueryScheduler objects from the heap dump:
@Zeyu-Chen-SFDC Thanks for the detailed report! Does this issue begin to happen when the broker is serving a lot of queries, and there are some that are timing out? Also, can you search for the query ID in the logs of the broker and the historical serving the queries and please share if there's something relevant.
Does this issue begin to happen when the broker is serving a lot of queries, and there are some that are timing out?
We haven't seen that. All these episodes began when the broker was lightly loaded, serving at most 1 other "normal" query concurrently. And the "normal" queries completed successfully. The impact of the stuck joins is as if they simply reduced the jetty threadpool capacity by a constant.
search for the query ID in the logs of the broker and the historical serving the queries and please share if there's something relevant.
The pattern of activities from the logs is as follows:
INFO [sql[26da5c75-0f2e-4cbc-a521-34de6ec17638]] org.apache.druid.offheap.OffheapBufferGenerator - Allocating new intermediate processing buffer[4] of size[524,288,000]
org.apache.druid.server.log.LoggingRequestLogger - 2024-09-20T00:13:47.028Z 127.0.0.1 {"success":false,"exception":"Total timeout 900000 ms elapsed"} {"query":"SELECT ....
org.apache.druid.server.AsyncQueryForwardingServlet - Exception handling request: {exceptionType=java.util.concurrent.TimeoutException, exceptionMessage=Total timeout 900000 ms elapsed, class=org.apache.druid.server.AsyncQueryForwardingServlet, exception=java.util.concurrent.TimeoutException: Total timeout 900000 ms elapsed, sqlQuery=SqlQuery{query='SELECT distinct ...
org.apache.druid.server.log.LoggingRequestLogger - 2024-09-20T00:13:47.029Z 127.0.0.1 {"query/time":900001,"success":false} {"query":"SELECT distinct ...
- no more logs associated with the queryid are seen beyond this point
Long running join query threads on brokers cannot be cancelled or interrupted
Affected Version
28.0.1
Description
Poorly written join queries are seen busy looping in
PostJoinCursor.advanceToMatch()
on broker's jetty threads. These queries have been running for days. While we have separate efforts to address the queries, we want to release all resources held up on the broker by these joins. When query cancellation is attempted withcurl -XDELETE 127.0.0.1:8088/druid/v2/sql/<QID>
on the broker, 404 response is returned, and the query thread on the broker continues as before.Here are some examined internal states of the broker:
SqlLifecycleManager
object contains the queryid being cancelled on.QueryScheduler.queryFutures
object does not contain any future under the subject queryid.druid.server.http.defaultQueryTimeout=600000