facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.29k stars 1.09k forks source link

Type name case sensitivity #10085

Open chliang71 opened 1 month ago

chliang71 commented 1 month ago

Description

Currently when checking type children here and here, it is using string equal which is case sensitive.

There is also a [test|https://github.com/facebookincubator/velox/blob/main/velox/type/tests/TypeTest.cpp#L405] which has a todo comment regarding case behavior.

Internally we are running into cases like Field not found: foobar. Available fields are: x, y, fooBar

So I wonder is there any particular reason/use case to be case sensitive? Or should we make it case insensitive? If yes, I can file a PR.

Yuhta commented 1 month ago

I would say the case sensitive matching is a reasonable default. Making it configurable without performance hit is also complicated. After all you can maintain a mapping from the user-facing names to normalized names in planner, and worker (thus Velox) only need to deal with normalized names thus case sensitivity should not be a problem here.

yingsu00 commented 1 month ago

My understanding that whether a column name is case sensitive depends on the database implementation and there is no rule to say if it should be case insensitive. https://www.scaler.com/topics/is-sql-case-sensitive/. So if Velox chooses to allow it to be case sensitive then it is what it is. @Yuhta Is this a common consensus in Meta?

Yuhta commented 4 weeks ago

@yingsu00 It's more about design the library to support more cases. Velox is at the bottom level so it has to be in the most strict mode, and engine authors can relax the criteria by normalizing the names. If Velox chose to match name case-insensitively, there would be nothing engine authors can do to make it case sensitive at higher level.

yingsu00 commented 3 weeks ago

I consulted more people on this matter, and as @tdcmeehan pointed out, "Case sensitivity is a property of storage. For example, Hive and Iceberg are case insensitive, but some SQL databases are case sensitive. It is not a property of the engine--the engine needs to respect the convention adopted by storage. " If this is correct, implementing sense insensitiviy in just one single Engine probably won't work. "Suppose Spark wrote fooBar, and Presto wrote FooBar. In the Presto planner, we say, OK, the normalized name is foobar. We plan out the query and one worker is reading fooBar, another is reading FooBar--how can we accommodate case insensitivity without knowing in advance that we've written out the column in two different formats"

It seems to me that we may want to normalize the column names to lower case in the Hive connector which supports both Hive table format and Iceberg table format. One possible place for it to happen is in SplitReader::adaptColumns() (and IcebergSplitReader::adaptColumns() if needed for any semantic differences between these two table formats). The general type system in the core Velox and Presto planner, however, need to remain case sensitive. Does it make sense?

Yuhta commented 3 weeks ago

No by storage he means metastore, not Velox. Files do not keep public facing names, that's the job of metastore. The names in files are just data stream identifiers. Engines (or the storage submodule in engines) need to match user facing names to these identifiers (with the help of metastore), and different engines can have different ways to match them (by name, case sensitive/insensitive, by position, by other unique id than name). Different engines using the same storage do need to keep consistent though, otherwise there will be confusion.

Yuhta commented 3 weeks ago

We do have an option to normalize all names in file to lower case to make matching by name case insensitive easier for now, but we do have plan to have some configurable mechanism to make the column matching more generic. The implementation would likely to be replace the names in file with some changed names, like what DWRF reader is doing when hive.orc.use-column-names=false), but the interface is to be designed.

yingsu00 commented 4 days ago

@Yuhta We met similar issue when the column name contains spaces. I created https://github.com/facebookincubator/velox/issues/10388. I think the say to solve it is to 1) send the HiveColumnIndex from Presto to Velox in HiveColumnHandle 2) Support this use-column-names=false in Velox 3) Add space as a valid char in Tokenizer.

Our team is already working on 1)

Yuhta commented 4 days ago

@yingsu00 I replied in #10388 that we should be able to solve it without (1) because we use HiveTableHandle::dataColumns to get full table schema.