Open yingsu00 opened 4 weeks ago
cc @Yuhta @agrawalreetika
https://github.com/facebookincubator/velox/pull/10784 is a temp solution to solve a urgent internal issue. The more correct way is to comply the Hive spec and use a config property to switch back to legacy usage.
My feeling is this should be handled in Presto not in Velox? In the types and column names we should receive unquoted plain literals.
My feeling is this should be handled in Presto not in Velox? In the types and column names we should receive unquoted plain literals.
I think Velox also needs change. The spec says all Unicode chars, like '号码‘. The current Tokenizer can only support numbers, letters and several special chars:
bool Tokenizer::isUnquotedPathCharacter(char c) {
return c == ':' || c == '$' || c == '-' || c == '/' || c == '@' || c == '|' ||
c == '#' || c == '.' || isUnquotedSubscriptCharacter(c);
}
bool Tokenizer::isUnquotedSubscriptCharacter(char c) {
return c == '-' || c == '_' || isalnum(c);
}
Yes Tokenizer
is the only place you need to change. Maybe we should rename it to PrestoTokenizer
since Spark would use a different parser.
My feeling is that the implementation should be for specific table formats, like HiveTokenizer, IcebergTokenizer, etc. The Hive spec allows all Unicode chars in quotation marks, while Iceberg normalize special chars into some other ascii chars (see https://github.com/apache/iceberg/issues/10120) . For example, an Iceberg table with TEST:A1B2.RAW.ABC-GG-1-A
column is transformed into TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA
.
I wonder how Spark parses HIve column names. Does it not follow the Hive spec?
@Yuhta Thanks for the link. I digged a bit more, and found column names like emp.no is not allowed in Hive, even if it's backtick quoted, according to https://issues.apache.org/jira/browse/HIVE-10120
CREATE TABLE a (`emp.no` string);
select `emp.no` from a; --fails with this message:
Also here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
In Hive 0.13 and later, column names can contain any [Unicode](http://en.wikipedia.org/wiki/List_of_Unicode_characters) character (see [HIVE-6013](https://issues.apache.org/jira/browse/HIVE-6013)), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see [HIVE-10120](https://issues.apache.org/jira/browse/HIVE-10120)).
So it seems to me we should throw a user error if the column names contains dot, and dot should be treated as delimiter only. But I don't know if this is reasonable for Spark users who're using Hive tables, so will leave it to the Gluten folks to decide. I added the comments to https://github.com/facebookincubator/velox/issues/7252 as well.
On our side, we can create a PrestoHiveTokenizer that implements all required things in Hive spec, which is either pre0.13(only numbers or letters) or after 0.13(all unicode chars under backtick quotations). Then we can add the config property hive.support-quoted-identifiers
to allow the user to switch back and force.
Do you think this is reasonable? If yes @agrawalreetika can continue the work on it. cc @tdcmeehan
I think it's good we beef up the current Tokenizer
into PrestoHiveTokenizer
. Note that this class should not be used in any production code path inside Velox though, it should only be used in unit tests. Velox should always receive structured Subfield
instead of raw strings in production code path. @rui-mo is helping on cleaning this up in #10693. The only usage in production code should be in Prestissimo code base.
Description
In Hive 0.12 and earlier, only alphanumeric and underscore characters are allowed in table and column names.
In Hive 0.13 and later, column names can contain any Unicode character (see HIVE-6013). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.
To revert to pre-0.13.0 behavior and restrict column names to alphanumeric and underscore characters, set the configuration property hive.support.quoted.identifiers to none. In this configuration, backticked names are interpreted as regular expressions. For details, see Supporting Quoted Identifiers in Column Names.