GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
618 stars 36 forks source link

create external table inconsistencies #2768

Open universalmind303 opened 5 months ago

universalmind303 commented 5 months ago

Description

there are some inconsistencies when using table options, especially in combination with create external table.

For many of them, they use the same object_store options, which allows for some rather inconsistent syntax

create external table test from bson
options (
  location = './external_schema.bson',
  file_type = 'bson'
)

create external table multi_report from excel 
options (
  location='./testdata/xlsx/multiple_sheets.xlsx', 
  file_type = 'xlsx', 
  sheet_name = 'other', 
  has_header = false
);

Additionally, the syntax can often be reversed for object stores (which is confusing)

create external table delta_azure_opts
from delta
options (
  location='azure://glaredb-test/ext-table*'
  account_name= 'AZURE_ACCOUNT',
  access_key='AZURE_ACCESS_KEY'
);

create external table ext_table_1 
from azure
credentials azure_creds 
options (
  location='azure://glaredb-test/ext-table*'
);

I don't think that both syntaxes should be supported, and i think we should drop support for one of them

Suggestion

for example from delta|lance|iceberg options (...) should not use the object store options, but it's own dedicated options, as they are their own storage formats, and don't integrate into object store nicely like file formats do. It also makes the distinction between the two a lot clearer.

Additionally, create external table t from <file_type> should probably never be valid. If you want to do that, you should probably be using the table function, or the inference from './path/to/file.<file_type>' instead.


For brevity, for the remainder of the text, from ... will refer to create external table t from .....

(create external table t from t -> from t)

in the above examples, I suggest they be rewritten to use from local instead of from <file_type | table_type>.

So just to recap, all File types (bson, json, csv, parquet, avro, ...) should use the object store as the from qualifier

from <local | s3 | azure | gcs> options (object_store_opts)

all storage formats should use the storage format as the from qualifier (delta, lance, iceberg)

from <delta | lance | iceberg | excel | ...> options (T_opts)

[1] I define a file format as a format that is used to represent a single table usually in a single file. (there is no notion of a table, and they are generally quite portable (csv, json, parquet, ...)).

A table format (AKA storage format) is any format that supports multiple tables (delta, lance, excel, iceberg).

tychoish commented 5 months ago

I pretty strongly believe that we should hide FROM <storage location/service> from users (because that's all information that we can (and do, in many cases) derive from the path, and using FROM exclusively for "format" rather than "location" simplifies this. The inverse (location and not format) would mean that we would either need to deduce format, or add additional syntax.