apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
63.07k stars 13.96k forks source link

Data preview for iceberg partitioned tables(using trino) does not work. #26449

Closed amir-bashir closed 2 months ago

amir-bashir commented 10 months ago

A clear and concise description of what the bug is.

In SQL LAB, previewing iceberg partitioned tables using trino connector is failing. 1 - Superset is reading partition data to show it below the table list. 2 - Then it is showing all columns and data types. 3 - In the last step it is executing trino query to fetch 100 rows for preview.

But this step is failing in my case. It is appending record_count, file_count, total_size and data fields from partition file and appending these four columns as where clause in trino query. Since these fields are not part of the table, trino throws error as shown in the picture below.

How to reproduce the bug

  1. Create an iceberg partitioned table in trino
  2. Open SQL Lab
  3. Select catalog, schema and table from the drop downs.
  4. You will see an error "trino error: line 5:7: Column 'record_count' cannot be resolved"

Expected results

The preview should run properly and display the data preview

Actual results

The preview fails with following error.

image

Environment

(please complete the following information):

Checklist

Superset logs are:

Triggering query_id: 44 2024-01-10 13:13:55,636:INFO:superset.sqllab.commands.execute:Triggering query_id: 44 Query 44: Executing 1 statement(s) 2024-01-10 13:13:55,669:INFO:superset.sql_lab:Query 44: Executing 1 statement(s) Query 44: Set query to 'running' 2024-01-10 13:13:55,669:INFO:superset.sql_lab:Query 44: Set query to 'running' 2024-01-10 13:13:55,752:DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): trino-dev2.digixt.ae:443 2024-01-10 13:13:55,939:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "POST /v1/statement HTTP/1.1" 200 328 2024-01-10 13:13:55,948:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "GET /v1/statement/queued/20240110_131355_04768_yj9fy/y61c87885b3716c3aeb602e582689e4bda331199b/1 HTTP/1.1" 200 328 2024-01-10 13:13:55,956:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "GET /v1/statement/queued/20240110_131355_04768_yj9fy/y845f4804062c99e367543efad25e818547c46f3b/2 HTTP/1.1" 200 337 2024-01-10 13:13:55,964:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "GET /v1/statement/executing/20240110_131355_04768_yj9fy/ya39456ccf05d76175d3ad158e116c2dc82d675ab/0 HTTP/1.1" 200 535 2024-01-10 13:13:56,042:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "GET /v1/statement/executing/20240110_131355_04768_yj9fy/ybca09ab00285e284f9ac19efa111537387f9f501/1 HTTP/1.1" 200 447 Query 44: Running statement 1 out of 1 2024-01-10 13:13:56,045:INFO:superset.sql_lab:Query 44: Running statement 1 out of 1 2024-01-10 13:13:56,165:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "POST /v1/statement HTTP/1.1" 200 327 2024-01-10 13:13:56,243:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "GET /v1/statement/queued/20240110_131356_04769_yj9fy/y18debc4fc194e74fbff9c77d6936c3c0ef40b10f/1 HTTP/1.1" 200 327 2024-01-10 13:13:56,263:DEBUG:urllib3.connectionpool:https://trino-dev2.digixt.ae:443 "GET /v1/statement/queued/20240110_131356_04769_yj9fy/ybd14d8bd60657510e8663226e660c6f4aa3223b7/2 HTTP/1.1" 200 1242 SupersetErrorsException Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1799, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) File "/usr/local/lib/python3.9/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps return f(self, *args, *kwargs) File "/app/superset/views/base_api.py", line 127, in wraps raise ex File "/app/superset/views/base_api.py", line 121, in wraps duration, response = time_function(f, self, args, kwargs) File "/app/superset/utils/core.py", line 1526, in time_function response = func(*args, kwargs) File "/app/superset/views/base_api.py", line 93, in wraps return f(self, *args, *kwargs) File "/app/superset/utils/log.py", line 255, in wrapper value = f(args, kwargs) File "/app/superset/sqllab/api.py", line 310, in execute_sql_query command_result: CommandResult = command.run() File "/app/superset/sqllab/commands/execute.py", line 121, in run raise ex File "/app/superset/sqllab/commands/execute.py", line 103, in run status = self._run_sql_json_exec_from_scratch() File "/app/superset/sqllab/commands/execute.py", line 161, in _run_sql_json_exec_from_scratch raise ex File "/app/superset/sqllab/commands/execute.py", line 156, in _run_sql_json_exec_from_scratch return self._sql_json_executor.execute( File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute raise SupersetErrorsException( superset.exceptions.SupersetErrorsException: [SupersetError(message="trino error: line 5:7: Column 'record_count' cannot be resolved", error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'Trino', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})] 2024-01-10 13:13:56,778:WARNING:superset.views.base:SupersetErrorsException Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1799, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(view_args) File "/usr/local/lib/python3.9/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps return f(self, *args, *kwargs) File "/app/superset/views/base_api.py", line 127, in wraps raise ex File "/app/superset/views/base_api.py", line 121, in wraps duration, response = time_function(f, self, args, kwargs) File "/app/superset/utils/core.py", line 1526, in time_function response = func(*args, kwargs) File "/app/superset/views/base_api.py", line 93, in wraps return f(self, *args, *kwargs) File "/app/superset/utils/log.py", line 255, in wrapper value = f(args, kwargs) File "/app/superset/sqllab/api.py", line 310, in execute_sql_query command_result: CommandResult = command.run() File "/app/superset/sqllab/commands/execute.py", line 121, in run raise ex File "/app/superset/sqllab/commands/execute.py", line 103, in run status = self._run_sql_json_exec_from_scratch() File "/app/superset/sqllab/commands/execute.py", line 161, in _run_sql_json_exec_from_scratch raise ex File "/app/superset/sqllab/commands/execute.py", line 156, in _run_sql_json_exec_from_scratch return self._sql_json_executor.execute( File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute raise SupersetErrorsException( superset.exceptions.SupersetErrorsException: [SupersetError(message="trino error: line 5:7: Column 'record_count' cannot be resolved", error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'Trino', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]

Additional context

On left side under table name i.e. schools, superset is showing latest partition data. Then it is using this information to create a select query which I have copied from copy button and pasted in the query pad.

jkleinkauff commented 8 months ago

Same here. Seems the same as https://github.com/apache/superset/issues/25307 I see both errors happening, "partition cannot be resolved" and "column record_count cannot be resolved"

anandnalya commented 2 months ago

I was able to get this working with the following patch which disables partitioning support for Iceberg:

--- a/site-packages/superset/db_engine_specs/trino.py
--- b/site-packages/superset/db_engine_specs/trino.py
@@ -445,6 +445,13 @@
         :returns: The indexes
         """
         try:
-            return super().get_indexes(database, inspector, table_name, schema)
+            indexes = super().get_indexes(database, inspector, table_name, schema)
+            # Handle iceberg tables. Even for non-partitioned tables, it returns a value
+            iceberg_cols_ignore = {"record_count", "file_count", "total_size", "data"}
+            if len(indexes) == 1 and indexes[0].get(
+                "name") == "partition" and iceberg_cols_ignore.issubset(
+                set(indexes[0].get("column_names", []))):
+                return []
+            return indexes
         except NoSuchTableError:
             return []
rusackas commented 2 months ago

Closing this in favor of https://github.com/apache/superset/issues/25307... but if you think the above diff fixes the issue, maybe it can be generalized a bit to (safely) address the issue on Iceberg and/or other data sources having similar issues?