apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.81k stars 13.53k forks source link

Presto DB regression on 3.0.0 #25450

Open zhaorui2022 opened 11 months ago

zhaorui2022 commented 11 months ago

After we upgrade to 3.0.0, we have noticed that we can't add new Presto dataset (very simple table with just a bigint column still works, but others don't) due to unable to fetch table metadata. More specifically it is api/v1/database/<int>/table/<table_name>/<schema>/ that's broken.

How to reproduce the bug

  1. Go to Datasets, and try to add any Presto table
  2. See error

Expected results

Show table columns as in previous version

Actual results

Shows an error

Screenshots

Screenshot 2023-09-28 at 12 24 59 AM

Environment

(please complete the following information):

Checklist

Make sure to follow these steps before submitting your issue - thank you!

Additional context

Add any other context about the problem here.

Here is a sample stacktrace

2023-09-28 06:00:04 
Traceback (most recent call last):
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/pip/sqlalchemy/sqlalchemy-cpython-310/lib/sqlalchemy/sql/schema.py", line 134, in _init_items
2023-09-28 06:00:04 
    spwd = item._set_parent_with_dispatch
2023-09-28 06:00:04 
AttributeError: 'str' object has no attribute '_set_parent_with_dispatch'
2023-09-28 06:00:04 
2023-09-28 06:00:04 
The above exception was the direct cause of the following exception:
2023-09-28 06:00:04 
2023-09-28 06:00:04 
Traceback (most recent call last):
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/superset-cpython-310/lib/superset/databases/api.py", line 728, in table_metadata
2023-09-28 06:00:04 
    table_info = get_table_metadata(database, table_name, schema_name)
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/superset-cpython-310/lib/superset/databases/utils.py", line 95, in get_table_metadata
2023-09-28 06:00:04 
    "selectStar": database.select_star(
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/superset-cpython-310/lib/superset/models/core.py", line 649, in select_star
2023-09-28 06:00:04 
    return self.db_engine_spec.select_star(
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/superset-cpython-310/lib/superset/db_engine_specs/presto.py", line 1097, in select_star
2023-09-28 06:00:04 
    return super().select_star(
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/superset-cpython-310/lib/superset/db_engine_specs/base.py", line 1383, in select_star
2023-09-28 06:00:04 
    partition_query = cls.where_latest_partition(
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/superset-cpython-310/lib/superset/db_engine_specs/presto.py", line 518, in where_latest_partition
2023-09-28 06:00:04 
    query = query.where(Column(col_name, col_type) == value)
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/pip/sqlalchemy/sqlalchemy-cpython-310/lib/sqlalchemy/sql/schema.py", line 1767, in __init__
2023-09-28 06:00:04 
    self._init_items(*args)
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/pip/sqlalchemy/sqlalchemy-cpython-310/lib/sqlalchemy/sql/schema.py", line 136, in _init_items
2023-09-28 06:00:04 
    util.raise_(
2023-09-28 06:00:04 
  File "/srv/superset/dropbox/apx/superset/superset_launcher.runfiles/__main__/thirdparty/superset_python3/pip/sqlalchemy/sqlalchemy-cpython-310/lib/sqlalchemy/util/compat.py", line 211, in raise_
2023-09-28 06:00:04 
    raise exception
2023-09-28 06:00:04 
sqlalchemy.exc.ArgumentError: 'SchemaItem' object, such as a 'Column' or a 'Constraint' expected, got 'VARCHAR'

I have temporarily reverted this part of the code back to what it was in 2.0.0, and that fixes the issue. But the commit introduced the change was supposed to fix another bug, so I am not sure if it is OK to just revert the commit. Sample commit.

michael-s-molina commented 11 months ago

@giftig Can you take a look?

giftig commented 11 months ago

@zhaorui2022 can you provide the presto version you're using? I only tested the change with trino, so it might be it's broken on certain older versions, back when Trino was still Presto, and especially since we're using different client libs for the two. The types we encounter here are a little complicated depending on exact context so we might need to either do some additional edge case handling or take a broader look at why we're getting such different types here.

giftig commented 11 months ago

This will probably be of interest to you as well @villebro

zhaorui2022 commented 11 months ago

I believe it is 0.272-2AC4790

Also we found two other regressions on Presto:

  1. timestamp are no longer correctly converted. We changed https://github.com/apache/superset/blob/master/superset/db_engine_specs/presto.py#L290 back to milliseconds which was what's used in 2.0.0, and it fixed the issue.
  2. It seems show columns also no longer works on some views created in Hive with errors like View data missing prefix: SELECT(day|hour_ts)?+.+FROM, which eventually fails database table and table_extra APIs. (Edit: This seems an expected behavior due to Hive/Presto syntax incompatibilities)
giftig commented 11 months ago

0.272 is significantly after presto and trino forked, the last common version was 0.215, in 2019. We've already started trying to divide the presto logic from the trino logic, and we're using a different client library now too, so this feels like a sign we might need to think about completely decoupling them; we'll likely introduce further regressions by fixing one and breaking the other, otherwise.

If this logic is currently broken on several presto versions we probably need to fork it, drop it from PrestoDbBaseSpec and move it to the trino one instead, and either reinstate an old version to the PrestoDbSpec or start from scratch (or probably a bit of both, start from an old version which seems to work and raise further issues from there).

It's unfortunate how much logic will end up forked / duplicated this way but it's probably unavoidable since presto/trino is forked itself, especially if we're seeing issues like timestamps having different expected precisions.

rusackas commented 6 months ago

@zhaorui2022 it looks like there are some pending comments on your PR that need addressing to merge it and close out this issue. If you're able to get that one across the finish line, that would be amazing. Otherwise, if there are interested parties reading this thread that might want to pick up where the PR leaves off, that would also be fantastic!

rusackas commented 1 month ago

This issue (and related PR) might be closed out as stale before long if nobody has interest in adopting them.