databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.87k stars 751 forks source link

Feature: load parquet/ndjson support case sensitive #16897

Open youngsofun opened 1 day ago

youngsofun commented 1 day ago

Summary

currently, when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.

Solution 1

parquet/ndjson add format option case_sensitive

cons:

  1. can not copy file ('a', 'b') into ('a', 'B')
  2. select, infer_infer_schema not show the orignal field names
    1. need to create a file_format for this purpose

Solution 2:

  1. select, infer_infer_schema: add table function option case_sensitive, true by default
  2. add copy option case_sensitive, false by default (for compatible)
    1. table fields check: allow ('a', 'B'), not allow ('a', 'A')
    2. parquet
      1. arrow_to_table_schema, to_lowercase and check for dup name
      2. impl 1 (trans to select): required fields to_lowercase
      3. impl 2 (match): table fields to_lower_case
    3. ndjson:
      1. both to_lower_case when matching names

cons:

  1. the default behavior of select and copy are not consist

pros

  1. select, infer_infer_schema show what the original name by default (maybe we can sacrifice this and let all case_sensitive=false by default for consist)
youngsofun commented 1 day ago

cc @sundy-li @everpcpc @wubx @Xuanwo

youngsofun commented 23 hours ago

choose Solution 2 after discuss with @sundy-li.

to make it more clear, I propose:

  1. add copy option COLUMN_MATCH_MODE
COLUMN_MATCH_MODE:
  CASE_SENSITIVE: Match columns by name, case-sensitive.
  CASE_INSENSITIVE: Match columns by name, case-insensitive.
  POSITION: Match columns by position instead of name.
  FORMAT_DEFAULT: Use the default matching behavior based on file format.

FILE_FORMAT:
  CSV: Default POSITION.
  Parquet/ORC/NDJson: Default CASE_INSENSITIVE.

note nota all mode for all format are supported, we will do them one by one

  1. select, infer_infer_schema add param column_name_to_lowercase=TRUE|FALSE, default false

    will remind user to use this option if there are related errors.