apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

octet_length(char) doesn't behave as expected. #6508

Open osawyerr opened 1 year ago

osawyerr commented 1 year ago

Describe the bug

Hi there,

select octet_length(char) doesn't behave as expected.

In the example below - in Postgres (text column) the below SQL would return 25 as the result however in DataFusion (StringArray column ) it just returns the original length of the values in n_name.

select octet_length(n_name::char(25)) from nation;

I'm not sure if this is because the cast to a char(25) is not behaving as expected

To Reproduce

create a StringColumn in DF, populate it with some values and run the following SQL on it.

select octet_length(n_name::char(25)) from nation;

Expected behavior

Should return 25, but it returns the length of each of the values in the column.

Additional context

No response

alamb commented 1 year ago

Thanks for the report @osawyerr

TLDR is that DataFusion treats VARCHAR and CHAR the same (because they use the same underlying Arrow type)

🤔 I am not quite sure what to do here. Postgres has the notion of a max width character column CHAR but arrow does not . DataFusion maps the SQL type CHAR --> arrow type Utf8

Thus I am no sure we were be able to replicate the behavior of postgres in this instance.

For anyone following along, here is what postgres does

postgres=# create table example(name varchar(20));
CREATE TABLE
postgres=# insert into example values ('foo'), ('barrr');
INSERT 0 2
postgres=# select octet_length(name) from example;
 octet_length
--------------
            3
            5
(2 rows)
postgres=# select octet_length(name::char(25)) from example;
 octet_length
--------------
           25
           25
(2 rows)
postgres=# select octet_length(name::varchar(25)) from example;
 octet_length
--------------
            3
            5
(2 rows)

postgres=#
osawyerr commented 1 year ago

@alamb There is also a related point RE how chars are handled in joins and equality in postgres vs. varcharas well. In postgres if I recall, the trailing spaces are ignored when comparing chars but are taken into account when comparing varchars.

From your explanation, DataFusion will always take into account the trailing spaces if a "char" is used in a join / equality. i.e. it will behave like a postgres varchar.

i.e.

-- note the trailing spaces below
select * from example where name::char(25) = 'foo    '::char(25) 

Will return no results in DataFusion but in postgres the result will be returned.