apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
382 stars 95 forks source link

[Format] Add arbitrarily nested (metadata) schemas #320

Open lidavidm opened 1 year ago

lidavidm commented 1 year ago

ADBC's model of database metadata is limited to a catalog-schema-table hierarchy; some systems allow unlimited nesting. (See #46.)

Proposal:

For example: a table like postgres1.dbname.schemaname.table (which has 4 layers of hierarchy rather than 3) would be represented as: catalog = postgres1, db_schema = dbname.schemaname, table = table.

This proposal applies across languages. See #317 for how to manage backwards compatibility.

lidavidm commented 1 year ago

There's some ambiguity here; what should we do if a schema name (for whatever reason) contains the delimiter? Normally the driver would be returning 'unescaped' names, but now there's nothing we can do here.

A different solution might be to make the name columns list<utf8>. But this complicates the common case (unfortunately) and would be a hard breaking change (even if not on the API level, no current client would be able to make use of the new schema).

It would be good to combine this with #621 for a future 'GetObjects2'-style function. (And possibly, break up the nested schema.)

lidavidm commented 1 year ago

It also significantly complicates the filters in the API; again, do you supply the escaped name? Or all functions have to take a list, which is a bit painful

lidavidm commented 1 year ago

And hmm, what do JDBC/ODBC do here? JDBC also assumes a fixed catalog/schema/table hierarchy. So what happens for Dremio, etc.?

Dremio doesn't even implement getDbSchemas (IIRC) despite having designed the spec, so not sure if we can use it as a real life example. What does the Trino JDBC driver do?

WillAyd commented 1 year ago

I don't know that I've personally worked with a database that allows for arbitrarily nested schemas but would be curious to know how client tools handle this via ODBC/JDBC today. I'm guessing that a nested schema would be represented as schema_parent.schema_child in most if not all cases, so I think we could fit that same pattern into the GetObjects design we have today without changing much.

You would lose a "clean" understanding of the schema hierarchy, but AFAIK that is not a SQL standard anyway

lidavidm commented 1 year ago

Yup, I'm just curious about a few things like

etc.

And I wonder if these systems even expose this info in their metadata in the first place, given the main use case appears to be pulling in data from external systems.