apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.37k stars 1.2k forks source link

Sqllogictests doesn't cover cases if the column name is not expected. #6349

Open alamb opened 1 year ago

alamb commented 1 year ago

Is your feature request related to a problem or challenge?

As we port our tests to sqllogictests https://github.com/apache/arrow-datafusion/issues/4460 it is important to retain test coverage

One thing that was noticed is that sqllogictest tests doesn't cover cases if the column name is not expected.

So for example, it will show the following outputs as the same

Int8
1
2

to

a
1
2

Describe the solution you'd like

It would be nice to have some way (perhaps optionally) that verified the column names as well

Describe alternatives you've considered

No response

Additional context

@comphead brought thus up on https://github.com/apache/arrow-datafusion/pull/6329#pullrequestreview-1425450768

alamb commented 1 year ago

@melgenek I wonder if you have any thoughts on this idea?

melgenek commented 1 year ago

As far as I can tell, sqllogictest in general, and sqllogictest-rs do not support column name checks right now.

There are some ways to extend sqllogictest-rs to support columns. It seems that to add native support for column names one would need to do the following: 1) add a colnames column check support to the query clause to optionally switch the check on/off, where off is the default. Column names could be the first row in the text representation

query II rowsort colnames
select 1 as one, 2 as two;
----
one two
1     2

2) update DBOutput and RecordOutput to have a field column_names. 3) update implementations of the sqllogictest::AsyncDB to return column names along with types and results 4) introduce a validator for column names similar to the type and result validator. A default implementation would probably be a no-op implementation to prevent behavior changes for library users other than Datafusion. A real implementation would likely just lowecase names and compare strings.


On the other hand, one could make comparisons work on the Datafusion side.

It seems that the EXPLAIN statement already gives the alias names for projections, so it is possible to run the same query twice: once with EXPLAIN, once without. This way both names and values are checked. Of course, there is a lot more information than just names in an EXPLAIN output.

Another way is to create a more powerful version of an arrow_typeof. For example, Databricks has a DESCRIBE QUERY and Snowflake a DESCRIBE RESULT that show the expected output metadata in format similar to

  +---------+------------+
  |col_name |data_type   |
  +---------+------------+
  |one      | int        |
  |two      | bigint     |
  +---------+------------+

This way both specific Arrow types and names could be checked with one query.

alamb commented 1 year ago

Thank you @melgenek -- this is excellent advice.

We already have a version of this (though this also looks at the data) with the DESCRIBE dataframe method... Maybe we could connect it into SQL

https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.describe