apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.29k stars 3.66k forks source link

Calls to /druid/v2/sql intermittently failing from Management UI #14965

Open jakcodex opened 10 months ago

jakcodex commented 10 months ago

Affected Version

27.0.0

Description

Apache Druid is deployed in a highly available cluster in AWS using Amazon Linux 2023 x86_64 AMI consisting of three-node master-with-zk cluster and an assortment of data/query nodes using MySQL-based metadata and S3 deep storage. The cluster presently has no segments or configured tasks.

Management UI running on a 27.0.0 server experiences intermittent failures for requests to /druid/v2/sql.

The Services page makes a request with the following payload:

{
    "query": "SELECT\n  \"server\" AS \"service\",\n  \"server_type\" AS \"service_type\",\n  \"tier\",\n  \"host\",\n  \"plaintext_port\",\n  \"tls_port\",\n  \"curr_size\",\n  \"max_size\",\n  \"is_leader\",\n  \"start_time\"\nFROM sys.servers\nORDER BY\n  (\n    CASE \"server_type\"\n    WHEN 'coordinator' THEN 8\n    WHEN 'overlord' THEN 7\n    WHEN 'router' THEN 6\n    WHEN 'broker' THEN 5\n    WHEN 'historical' THEN 4\n    WHEN 'indexer' THEN 3\n    WHEN 'middle_manager' THEN 2\n    WHEN 'peon' THEN 1\n    ELSE 0\n    END\n  ) DESC,\n  \"service\" DESC"
}

Which sometimes get the error response:

400 Bad Request

{
    "error": "Plan validation failed",
    "errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 11, column 3 to line 11, column 14: Column 'start_time' not found in any table",
    "errorClass": "org.apache.calcite.tools.ValidationException",
    "host": null
}

Unified Console home page makes the following request:

{
    "query": "SELECT\n  COUNT(*) AS \"active\",\n  COUNT(*) FILTER (WHERE is_available = 1) AS \"cached_on_historical\",\n  COUNT(*) FILTER (WHERE is_available = 0 AND replication_factor > 0) AS \"unavailable\",\n  COUNT(*) FILTER (WHERE is_realtime = 1) AS \"realtime\"\nFROM sys.segments\nWHERE is_active = 1"
}

Which sometimes get the error response:

400 Bad Request

{
    "error": "Plan validation failed",
    "errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 4, column 47 to line 4, column 64: Column 'replication_factor' not found in any table",
    "errorClass": "org.apache.calcite.tools.ValidationException",
    "host": null
}

Strangely, when watching the automatic refresh requests on the Services page, the error occurs exactly every other request consistently.

image

Errors are being logged in the corresponding broker's log file.

2023-09-12T07:00:42,096 WARN [sql[d2038c2c-bf10-4305-93e2-679badd03e51]] org.apache.druid.sql.http.SqlResource - Exception while processing sqlQueryId[d2038c2c-bf10-4305-93e2-679badd03e51] (SqlPlanningException{msg=org.apache.calcite.runtime.CalciteContextException: From line 11, column 3 to line 11, column 14: Column 'start_time' not found in any table, code=Plan validation failed, class=org.apache.calcite.tools.ValidationException, host=null})
2023-09-12T06:57:32,797 WARN [sql[0317f126-692e-4a5e-ad76-ad422a63891a]] org.apache.druid.sql.http.SqlResource - Exception while processing sqlQueryId[0317f126-692e-4a5e-ad76-ad422a63891a] (SqlPlanningException{msg=org.apache.calcite.runtime.CalciteContextException: From line 4, column 47 to line 4, column 64: Column 'replication_factor' not found in any table, code=Plan validation failed, class=org.apache.calcite.tools.ValidationException, host=null})

With all cluster nodes installed with 26.0.0 this issue is not present. With all cluster nodes installed to 27.0.0 the issue is present. If you then downgrade a query/router node to 26.0.0 the issue is no longer present on that specific node.

abhishekagarwal87 commented 10 months ago

These columns were added recently. It is possible that one of the broker nodes is running an older version despite the upgrade. Do confirm that the broker throwing these errors is indeed running on 27.

github-actions[bot] commented 2 weeks ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.