duckdb / postgres_scanner

MIT License
195 stars 35 forks source link

Add PostgresViewEntry #183

Open Tishj opened 4 months ago

Tishj commented 4 months ago

In the table query we get both tables, views, materialized views, partitioned tables and foreign tables. The views are then stored in the catalog as a TableCatalogEntry, which confuses duckdb_tables() and duckdb_views() which is used inside information_schema.tables

So the views showed up as tables, this PR fixes that and adds a test for it

Notable change:

In postgres, when views are created on a dummy binding, with no aliases, their resulting alias becomes "?column?":

(lldb) p view_catalog_entry.names
(duckdb::vector<std::string>) {
  std::__1::vector<std::__1::string, std::__1::allocator<std::__1::string> > = size=1 {
    [0] = "?column?"
  }
}

As a result, our binding will fail because the names don't match. To get around this, we don't save the entry as "?column?" but instead use the name that DuckDB would attribute to the expression

Tishj commented 4 months ago

This is lacking testing still, and there are probably virtual methods that need to be overridden

Tishj commented 4 months ago

The failure is probably related to:

statement ok
CREATE VIEW s1.v1 AS SELECT 42 as j

Our error check is probably not a fan of this, because there is no existing table to reference

Or actually.. Both test_view.test and attach_views.test create or replace v1. Maybe these are somehow happening concurrently, messing with the persistent postgres database file?

Tishj commented 3 months ago

@Mytherin This seems like a problem, I can't really find the reason for this behavior

# Not aliased (no expression alias)
statement ok
CREATE VIEW s1.view_aliases.x AS SELECT 'x'

query I
select x from s1.view_aliases.x
----
{'CAST('x' AS VARCHAR)': x}

I've had to replace the CreateViewInfo names with the expression.ToString() if the name was "?column?", this seems to have this effect

Mytherin commented 3 months ago

The fact that we are getting two different names (one from Postgres', one from DuckDB') indicates to me we are binding the same query both in Postgres and in DuckDB. That seems like a more fundamental problem to me. After all - if a view contains anything that DuckDB does not support, the binding in DuckDB will fail, whereas we should still be able to query the view.

Maybe we could try e.g. not pulling the SQL definition of the view into DuckDB so the binding is avoided?

Mytherin commented 3 months ago

An example of a view that works in Postgres but wouldn't work in DuckDB might be:

create table my_table(i int);
create view my_view as select ctid from my_table;
Tishj commented 3 months ago

An example of a view that works in Postgres but wouldn't work in DuckDB might be:

create table my_table(i int);
create view my_view as select ctid from my_table;

You're right, that doesn't work But that also doesn't work for select ctid from s1.my_table, because we've never registered ctid as being part of my_table in our custom postgres catalog

That's another issue we can solve by turning the columns into separate catalog entries, we could add default entries for things like ctid. And come to think of it, rowid as well for our internal use

Tishj commented 3 months ago

The problem with this lies in Binder::Bind(BaseTableRef &ref) Even if I remove the query from the created ViewCatalogEntry, this code looks up the catalog entry, binds it and compares the new results versus the old results, to throw the "Contents of view were altered: types don't match!" error.

(also setting the query to nullptr causes an InternalException here currently because this path assumes the query is always present) Perhaps we want to just not do this verification step if the saved query is nullptr?

Tishj commented 3 months ago

The problem here is that we first bind the view's query before we hand it off to postgres, so we have to be able to bind it ourselves (possibly delegating to the postgres catalog extension)

Then we create the entry in the postgres catalog, running the same query, then reloading to get the entry as it was created in postgres. This saves it as "?column?" because that's what postgres assigned to it.

Then in our verification step inside Bind(BaseTableRef &ref) we bind the query again on duckdb's side, which produces a different name for the column.


The "skip view content verification if query is nullptr" solution could work in that case, because we won't bind the query on our side again. That still leaves the problem of the first bind though, if system columns are referenced that we haven't explicitly made a part of the TableCatalogEntry the bind will fail (as evidenced by the create view my_view as select ctid from my_table example)

One solution there might be decompose the TableCatalogEntry into a catalog set containing columns, and using a DefaultGenerator to create the system columns Another could be to always add the system columns manually in our (postgres)TableCatalogEntry creation code.

A more generic approach that will likely take the least effort is to add a virtual method on the table catalog entry for column retrieval, so we can hook in and return a column definition for "ctid" and friends, basically a poor mans "DefaultColumnGenerator"


The last idea to make this work is take a similar approach as BindCreateIndex does and let BindCreateView be a virtual method on Catalog so we can override the behavior in the PostgresCatalog Skipping the initial bind, and also the rebind that is currently done to verify that "View contents were altered" did not occur

Tishj commented 2 months ago

We're likely going to have to separate out the SelectStatement inside the CreateViewInfo, then construct a postgres query from that, get the names + types from the PGresult object, and convert the type into a LogicalType

Oid PQftype(const PGresult *res,
            int column_number);

That gets us the Oid, then we can run another query to retrieve the information we need to transform PG -> LogicalType

One downside of this is that we have to actually execute the query to get the names+types I have seen other solutions which involve creating a prepared statement and querying metadata from it, that might be interesting?