facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.42k stars 1.12k forks source link

Support both legacy(pre 0.13, letters and numbers) and new specification(post 0.13, all unicode in backticks) for the Hive identifiers #10785

Open yingsu00 opened 4 weeks ago

yingsu00 commented 4 weeks ago

Description

In Hive 0.12 and earlier, only alphanumeric and underscore characters are allowed in table and column names.

In Hive 0.13 and later, column names can contain any Unicode character (see HIVE-6013). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.

To revert to pre-0.13.0 behavior and restrict column names to alphanumeric and underscore characters, set the configuration property hive.support.quoted.identifiers to none. In this configuration, backticked names are interpreted as regular expressions. For details, see Supporting Quoted Identifiers in Column Names.

yingsu00 commented 4 weeks ago

cc @Yuhta @agrawalreetika

yingsu00 commented 4 weeks ago

https://github.com/facebookincubator/velox/pull/10784 is a temp solution to solve a urgent internal issue. The more correct way is to comply the Hive spec and use a config property to switch back to legacy usage.

Yuhta commented 4 weeks ago

My feeling is this should be handled in Presto not in Velox? In the types and column names we should receive unquoted plain literals.

yingsu00 commented 4 weeks ago

My feeling is this should be handled in Presto not in Velox? In the types and column names we should receive unquoted plain literals.

I think Velox also needs change. The spec says all Unicode chars, like '号码‘. The current Tokenizer can only support numbers, letters and several special chars:

bool Tokenizer::isUnquotedPathCharacter(char c) {
  return c == ':' || c == '$' || c == '-' || c == '/' || c == '@' || c == '|' ||
      c == '#' || c == '.' || isUnquotedSubscriptCharacter(c);
}

bool Tokenizer::isUnquotedSubscriptCharacter(char c) {
  return c == '-' || c == '_' || isalnum(c);
}
Yuhta commented 3 weeks ago

Yes Tokenizer is the only place you need to change. Maybe we should rename it to PrestoTokenizer since Spark would use a different parser.

yingsu00 commented 3 weeks ago

My feeling is that the implementation should be for specific table formats, like HiveTokenizer, IcebergTokenizer, etc. The Hive spec allows all Unicode chars in quotation marks, while Iceberg normalize special chars into some other ascii chars (see https://github.com/apache/iceberg/issues/10120) . For example, an Iceberg table with TEST:A1B2.RAW.ABC-GG-1-A column is transformed into TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA. I wonder how Spark parses HIve column names. Does it not follow the Hive spec?

Yuhta commented 3 weeks ago

7252 has some details about the use cases of Spark/Gluten. If we can unify the parser on table formats, then we can name it after table formats.

yingsu00 commented 3 weeks ago

@Yuhta Thanks for the link. I digged a bit more, and found column names like emp.no is not allowed in Hive, even if it's backtick quoted, according to https://issues.apache.org/jira/browse/HIVE-10120


CREATE TABLE a (`emp.no` string);
select `emp.no` from a; --fails with this message:

Also here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

In Hive 0.13 and later, column names can contain any [Unicode](http://en.wikipedia.org/wiki/List_of_Unicode_characters) character (see [HIVE-6013](https://issues.apache.org/jira/browse/HIVE-6013)), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see [HIVE-10120](https://issues.apache.org/jira/browse/HIVE-10120)). 

So it seems to me we should throw a user error if the column names contains dot, and dot should be treated as delimiter only. But I don't know if this is reasonable for Spark users who're using Hive tables, so will leave it to the Gluten folks to decide. I added the comments to https://github.com/facebookincubator/velox/issues/7252 as well.

On our side, we can create a PrestoHiveTokenizer that implements all required things in Hive spec, which is either pre0.13(only numbers or letters) or after 0.13(all unicode chars under backtick quotations). Then we can add the config property hive.support-quoted-identifiers to allow the user to switch back and force.

Do you think this is reasonable? If yes @agrawalreetika can continue the work on it. cc @tdcmeehan

Yuhta commented 3 weeks ago

I think it's good we beef up the current Tokenizer into PrestoHiveTokenizer. Note that this class should not be used in any production code path inside Velox though, it should only be used in unit tests. Velox should always receive structured Subfield instead of raw strings in production code path. @rui-mo is helping on cleaning this up in #10693. The only usage in production code should be in Prestissimo code base.