apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.92k stars 1.12k forks source link

Only Allow Declaring Partition Columns in `PARTITIONED BY` Clause #9465

Closed devinjdangelo closed 5 months ago

devinjdangelo commented 6 months ago

Is your feature request related to a problem or challenge?

DataFusion implicity reorders columns in table definitions so that PARTITION BY columns are stored at the end of the underlying parquet files. This leads to very confusing behavior when selecting directly out of the parquet file as the parquet schema has a different column order than the order of the columns in the CREATE TABLE statement.

Datafusions SQL dialect declares partitioned tables like this:

CREATE EXTERNAL TABLE(partition varchar, trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition)
LOCATION '/tmp/test/';

Note that the partition column is declared with the other columns and again later in the PARTITIONED BY clause. Internally, Datafusion reorders table schemas so that partition columns come at the end, which is a common convention. This leads to confusing examples like #7892 and the following

DataFusion CLI v36.0.0
❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition) location '/tmp/test/';
0 rows in set. Query took 0.001 seconds.

❯ insert into test values ('a','x'),('b','y'),('c','z');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.016 seconds.

❯ select * from test;
+----------+-----------+
| trace_id | partition |
+----------+-----------+
| a        | x         |
| c        | z         |
| b        | y         |
+----------+-----------+
3 rows in set. Query took 0.002 seconds.

Since you declared the order as (partition varchar, trace_id varchar) you would expect this order to be respected when inserting data, but instead it is silently reordered so that the partition column comes at the end.

Describe the solution you'd like

Rework CREATE EXTERNAL TABLE syntax to only allow partition by columns to be declared in the partitioned by clause. The above example then becomes:

CREATE EXTERNAL TABLE(trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition varchar)
LOCATION '/tmp/test/';

This leaves much less room for confusion about the ordering of the columns when inserting values. This also follows the syntax of HiveQL, see: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-PartitionedTables

Describe alternatives you've considered

We could instead drop the convention of moving the partitioned by columns to the end of the schema and respect the ordering of columns that the user declares.

Additional context

No response

devinjdangelo commented 6 months ago

Thanks for the interest in picking this up @Lordworms!

Since this would be a breaking change and possibly overlap/conflict with #9369, we should make sure that we have consensus on the plan before getting too deep into it. cc @alamb and @metesynnada if you have any thoughts on this.

alamb commented 6 months ago

If we can keep both the current behavior as well I think this is a good idea (aka if this is backwards compatible)

So specifically, if both the following SQL statements result in the same table schema (trace_id varchar, partition varchar)

CREATE EXTERNAL TABLE  test(partition varchar, trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition) -- no type specified here
LOCATION '/tmp/test/';
CREATE EXTERNAL TABLE test(trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition varchar)
LOCATION '/tmp/test/';
alamb commented 6 months ago

If we need to break backwards compatibility, I think it should be discussed more widely

devinjdangelo commented 6 months ago

If we need to break backwards compatibility, I think it should be discussed more widely

I think that we either should

I favor the first option and believe it would be much easier to implement. We could perhaps go for a phased approach where we deprecate the existing syntax with a warning of why it is not recommended, though maintaining both syntaxes would be more complex.

With all that said if we are against a breaking change, we could simply update documentation to increase visibility into the existing behavior and especially clarify that partition columns must be moved to the end when inserting data.

alamb commented 6 months ago

break backwards compatibility and throw an error when a column is declared as both a regular column and a partition column

This seems reasonable to me

Here is what I suggest to get some sort of consensus about this potentially breaking change:

  1. Send a note to the mailing list ("we are considering a breaking change to CREATE EXTERNAL TABLE ... please comment on the ticket if you have an opinion")
  2. Maybe also cross post to discord / slack

The idea is to get all the feedback into github but try and make sure as many people have a chance to weigh in as possible

MohamedAbdeen21 commented 6 months ago

If we can keep both the current behavior as well I think this is a good idea (aka if this is backwards compatible)

I think this is very much possible with minimal changes to the parser

devinjdangelo commented 6 months ago

I think this is very much possible with minimal changes to the parser

@MohamedAbdeen21 if you are interested in working on a PR to implement this, that would be much appreciated :pray: . I think that would be a great step to take prior to gathering feedback/consensus on making the change breaking.

MohamedAbdeen21 commented 6 months ago

I'll try to get a draft PR up before EoD