apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
60.33k stars 13.01k forks source link

Data preview for trino tables does not work for all trino connection types #25307

Open jshumway-zs opened 9 months ago

jshumway-zs commented 9 months ago

A clear and concise description of what the bug is.

Data preview for trino tables does not work for all trino connectors (specifically for iceberg in our case). This issue results because the response from "select * from table$partitions" is dependent on the connection type (this is used in superset/db_engine_specs/presto.py). It seems that the code was written based off of the response for a hive connection, but the response schema is different for other connection types (see the iceberg docs here for example).

How to reproduce the bug

  1. create an iceberg table in trino
  2. perform an action which will trigger a table preview on the table (such as selecting the table from the dropdowns in sqllab)
  3. you will see an error such as "column 'partition' cannot be resolved"

Expected results

The preview should run properly and display the data preview

Actual results

The preview fails and an error message is shown

Screenshots

image

Environment

(please complete the following information):

Additional context

I have fixed this on our end by checking if the returned partition columns match the schema described in the iceberg documentation and if so, I return None. However this doesn't allow us to see the most recent partition and is also specific to iceberg. I would recommend setting latest_partition to False in the select_star method for trino until a better solution is found, as selecting from the most recent partition isn't critical in most cases.

OmarSultan85 commented 9 months ago

Hello, I am facing the same issue, were you able to resolve this?

jshumway-zs commented 8 months ago

Hello, I am facing the same issue, were you able to resolve this?

Hey @OmarSultan85, for now we have just forked and removed the most recent partition logic as it isn't critical for us. I've found that it also affects bigquery tables accessed through trino and is still an issue in version 3.0.0 as well.

amir-bashir commented 5 months ago

@jshumway-zs Can you please share which files need to be modified and which function to be removed to make it work. In my case superset 3.0 is adding record_count, and file_count to where clause for every iceberg table. What it does is read the metadata information from partition file and then use values from this file to append as where clause.

jshumway-zs commented 5 months ago

@jshumway-zs Can you please share which files need to be modified and which function to be removed to make it work. In my case superset 3.0 is adding record_count, and file_count to where clause for every iceberg table. What it does is read the metadata information from partition file and then use values from this file to append as where clause.

Here is a snippet from our dockerfile which we are using to correct. Basically we return early in presto.py on line 524 and then remove lines 73-92 in trino.py. Obviously the line numbers will depend on the release. Its been a while since I looked at this, so I'm not sure which methods are being modified here, but if you look a the code from the 3.0.0 branch you should get an idea.

FROM apache/superset:3.0.0rc4

USER root

RUN sed -i "524i \        return None" /app/superset/db_engine_specs/presto.py
RUN sed -i 73,92d /app/superset/db_engine_specs/trino.py

USER superset

Just looked back at the code and it looks like we are exiting early from PrestoBaseEngineSpec._latest_partition_from_df and making sure metadata['partitions'] isn't getting set in TrinoEngineSpec.extra_table_metadata.

jkleinkauff commented 3 months ago

+1 we are facing the same issue here. Superset 3.1.1 and Iceberg tables in Trino.

rusackas commented 3 months ago

Pinging some users/SMEs from the rolodex on this one, since a solution might actually close TWO issues: @john-bodley @sujiplr @nytai

zspsole commented 2 weeks ago

This is still the case on latest release (4.0.1). As described by @jshumway-zs, the Trino db engine extends from the Presto one, that has the method where_latest_partition defined that the preview fuction uses to make sure you only preview data from the latest partition.

This method adds a WHERE clause using all the columns/values from the $partitions first entry. This doesn't work in Trino. If it's not fixed at least it could be removed by overriding the method to the default implementation (return None).

New offsets for the same solution to disable partitioning tracking for 4.0.1:

FROM apache/superset:4.0.1

USER root
RUN sed -i "494i \        return None" /app/superset/db_engine_specs/presto.py
RUN sed -i 81,99d /app/superset/db_engine_specs/trino.py
USER superset