youngsofun commented 1 day ago

Summary

currently， when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.

Solution 1

parquet/ndjson add format option case_sensitive

cons:

can not copy file ('a', 'b') into ('a', 'B')
select, infer_infer_schema not show the orignal field names
1. need to create a file_format for this purpose

Solution 2：

select, infer_infer_schema: add table function option case_sensitive, true by default
add copy option case_sensitive, false by default （for compatible）
1. table fields check: allow ('a', 'B'), not allow ('a', 'A')
2. parquet
  1. arrow_to_table_schema, to_lowercase and check for dup name
  2. impl 1 (trans to select): required fields to_lowercase
  3. impl 2 (match): table fields to_lower_case
3. ndjson:
  1. both to_lower_case when matching names

cons：

the default behavior of select and copy are not consist

pros

select, infer_infer_schema show what the original name by default (maybe we can sacrifice this and let all case_sensitive=false by default for consist)

youngsofun commented 1 day ago

cc @sundy-li @everpcpc @wubx @Xuanwo

youngsofun commented 23 hours ago

choose Solution 2 after discuss with @sundy-li.

to make it more clear, I propose:

add copy option COLUMN_MATCH_MODE

COLUMN_MATCH_MODE:
  CASE_SENSITIVE: Match columns by name, case-sensitive.
  CASE_INSENSITIVE: Match columns by name, case-insensitive.
  POSITION: Match columns by position instead of name.
  FORMAT_DEFAULT: Use the default matching behavior based on file format.

FILE_FORMAT:
  CSV: Default POSITION.
  Parquet/ORC/NDJson: Default CASE_INSENSITIVE.

note nota all mode for all format are supported, we will do them one by one

select, infer_infer_schema add param column_name_to_lowercase=TRUE|FALSE, default false

will remind user to use this option if there are related errors.

databendlabs / databend

Feature: load parquet/ndjson support case sensitive #16897

Solution 1

Solution 2：