apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.43k stars 3.69k forks source link

Join query on broker becomes uncancellable #17163

Open Zeyu-Chen-SFDC opened 4 days ago

Zeyu-Chen-SFDC commented 4 days ago

Long running join query threads on brokers cannot be cancelled or interrupted

Affected Version

28.0.1

Description

Poorly written join queries are seen busy looping in PostJoinCursor.advanceToMatch() on broker's jetty threads. These queries have been running for days. While we have separate efforts to address the queries, we want to release all resources held up on the broker by these joins. When query cancellation is attempted with curl -XDELETE 127.0.0.1:8088/druid/v2/sql/<QID> on the broker, 404 response is returned, and the query thread on the broker continues as before.

Here are some examined internal states of the broker:

Zeyu-Chen-SFDC commented 4 days ago

Here is the flamegraph from periodic jstack captures of the broker query thread:

flame

abhishekagarwal87 commented 4 days ago

did this query override the timeout by any chance?

Zeyu-Chen-SFDC commented 4 days ago

did this query override the timeout by any chance?

As far as I could ascertain, no. I examined the corresponding SqlQueryPlus.queryContext object in the heap and did not find any timeout override.

Zeyu-Chen-SFDC commented 4 days ago

QueryRunner and QueryScheduler objects from the heap dump: Screenshot 2024-09-25 at 9 49 26 PM Screenshot 2024-09-25 at 9 51 04 PM

LakshSingla commented 4 days ago

@Zeyu-Chen-SFDC Thanks for the detailed report! Does this issue begin to happen when the broker is serving a lot of queries, and there are some that are timing out? Also, can you search for the query ID in the logs of the broker and the historical serving the queries and please share if there's something relevant.

Zeyu-Chen-SFDC commented 4 days ago

Does this issue begin to happen when the broker is serving a lot of queries, and there are some that are timing out?

We haven't seen that. All these episodes began when the broker was lightly loaded, serving at most 1 other "normal" query concurrently. And the "normal" queries completed successfully. The impact of the stuck joins is as if they simply reduced the jetty threadpool capacity by a constant.

search for the query ID in the logs of the broker and the historical serving the queries and please share if there's something relevant.

The pattern of activities from the logs is as follows:

org.apache.druid.server.AsyncQueryForwardingServlet - Exception handling request: {exceptionType=java.util.concurrent.TimeoutException, exceptionMessage=Total timeout 900000 ms elapsed, class=org.apache.druid.server.AsyncQueryForwardingServlet, exception=java.util.concurrent.TimeoutException: Total timeout 900000 ms elapsed, sqlQuery=SqlQuery{query='SELECT distinct ...

org.apache.druid.server.log.LoggingRequestLogger - 2024-09-20T00:13:47.029Z 127.0.0.1 {"query/time":900001,"success":false} {"query":"SELECT distinct ...


- no more logs associated with the queryid are seen beyond this point