Allow creating native tables with different formats

universalmind303 commented 10 months ago

Description

With the upcoming support of lance format I'd like to be able to create native tables that use lance instead of delta

Some dialects use the USING keyword. others (clickhouse) use ENGINE =. Probably some other keywords used in other dialects as well.

create table t1 as (select 1) USING lance
create table t1 as (select 1) (ENGINE = 'lance')

-- defaults to delta if unspecified
create table t1 as (select 1)
-- is the same as
create table t1 as (select 1) USING delta

As a potential followup: I think it would be cool if we could inspect the schema of the table (maybe statistics too) to determine the optimal table representation.

For example, Lance is highly optimized for ml use cases FixedSizeList, Binary, ... while Delta is optimized for more traditional OLAP workloads.

We could likely infer if the workload is ML, or standard OLAP based off of the schema + statistics. Then we could use different backends (delta or lance) depending on the workload.

jordandakota commented 10 months ago

I don't know why but when I read this I think of an old presentation around good clean code, where the prof is quoting other leaders in industry and one of the quotes is, "a method is something that does one thing well."

I think adding additional storage options means adding additional readers and writers and adds a lot of maintenence.

universalmind303 commented 10 months ago

I think adding additional storage options means adding additional readers and writers and adds a lot of maintenence.

But the readers and writers for the given formats would already exist? I don't think end users would face any additional maintenance as it'd be an opt-in feature.

Since it is opt-in, I see this as a feature for "power users" that are aware of the tradeoffs of using different storage formats.

tychoish commented 10 months ago

Format Selectivity

Having a way to specify data format (as well as some tuning options like chunk/batch size and per-column compression formats, potentially) seems like a great capability. I think I'm vaguely partial to WITH STORAGE SETTINGS (k=v, k=v) or something like that, but negotiable on the details.

Having said that I'd characterize all of it as "column oriented" and therefore OLAP centric, in all of its form (not just delta, but datafusion itself).

Automatic Format Detection

I think it would be cool if we could inspect the schema of the table (maybe statistics too) to determine the optimal table representation.

This is definitely a feature that sounds nice, but I think would be a lot of different work. I also can't necessarily imagine the situation/user who knows the schema of the table that they're creating but doesn't know what the table's going to be used for? Also what happens when the table is going to be used for more than one thing? How should people decide? (what do we do in this situation?) Is there a situation where people might be doing both kinds of workloads over the same data set?

In a lot of ways, these data format decisions are sort of akin to "deciding which indexes to build for a table" in a OLTP database, and "which fields to index" is a thing which you can do "automatically" (though it's hard to write really good code that does that as well automatically/generically as a human would do?) In a lot of ways the whole history and design of relational databases (and maybe databases themselves) revolve around "providing users with the ability to tell the database engine how the user expects the data to be used."

The other sort of challenge with the "automatic view" operation is that it makes adding new/arbitrary data formats harder, as you have to predict the workload, potentially? That seems like a pretty high burden. I'm also aware that when you create a table, we won't, yet, have the information about the usage statistics, and we also don't have a system for collecting (and storing and retaining) that data, so there are a lot of pieces.

tychoish commented 9 months ago

I've been poking around some of this code in the last week, and I was confused for a moment. I think we should definitely call this "engine" [delta, lance] as "format" made me think about CSV/JSON/etc.

GlareDB / glaredb