apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.23k stars 1.18k forks source link

Invalid parsing results produced by OwnedTableReference::from(&str) #5793

Closed jdye64 closed 1 year ago

jdye64 commented 1 year ago

Describe the bug

Invoking OwnedTableReference::from("schema.1table") produces an invalid TableReference if the "table name" starts with a number. In this example the result would be

Bare { table: "schema.1table" }

when in fact the correct result should be

Partial { schema: "schema", table: "1table" }

To Reproduce

A simple test can be added to datafusion/common/src/utils.rs to reproduce the bug.

#[test]
    fn test_split_names() -> Result<()> {
        let result = parse_identifiers_normalized("schema.1table");
        assert_eq!(result.len(), 2);
        Ok(())
    }

Expected behavior

Invoking OwnedTableReference::from("schema.1table") should produce the result

Partial { schema: "schema", table: "1table" }

Additional context

I have traced the issue to sqlparser-rs returning a Token::Word rather than expected Token::Period at https://github.com/apache/arrow-datafusion/blob/main/datafusion/common/src/utils.rs#L200

My main question is should be be handled in DataFusion or sqlparser-rs?

andygrove commented 1 year ago

This behavior may be dialect-specific. ANSI SQL states that table names must start with a Latin alphabetic character, for example.

andygrove commented 1 year ago

@jdye64 It looks like the hive dialect in sqlparser-rs supports table names starting with digits:

impl Dialect for HiveDialect {

    fn is_identifier_start(&self, ch: char) -> bool {
        ch.is_ascii_lowercase() || ch.is_ascii_uppercase() || ch.is_ascii_digit() || ch == '$'
    }
jdye64 commented 1 year ago

@andygrove fair point. I guess the reason I didn't consider that was because this was an existing test that had previously passed but starting failing when we upgraded to DataFusion 21.0.0. However, your point is valid and it seems like that test should have never been passing in the first place.

alamb commented 1 year ago

Invoking OwnedTableReference::from("schema.1table") produces an invalid TableReference if the "table name" starts with a number. In this example the result would be

I think this change in behavior came from https://github.com/apache/arrow-datafusion/pull/5343

Basically, the code that used to parse TableReference::from was some custom code in DataFusion that was not consistent with SQL parser.

If your goal is to replicate the old behavior you avoid calling TableReference::from and instead use the old code (see https://github.com/apache/arrow-datafusion/pull/5343/files#r1130008949) and then call TableReferencePartial::partial

https://docs.rs/datafusion/21.0.0/datafusion/catalog/enum.TableReference.html#method.partial

cc @Jefffrey

Jefffrey commented 1 year ago

Yes this is expected behaviour, see #5426 for similar issue