databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.71k stars 732 forks source link

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

Open rad-pat opened 1 week ago

rad-pat commented 1 week ago

Search before asking

Version

v1.2.635-nightly

What's Wrong?

When trying to load data via COPY INTO from a parquet file with case sensitive columns, an ambiguous column error is reported. Looks to be from https://github.com/datafuselabs/databend/blob/2c41faf63b5136ab22a2ab3ac972c9752dc4a7bf/src/query/sql/src/planner/binder/bind_context.rs#L301

The parquet file contains columns such as c1 and C1 which exist within the target table.

How to Reproduce?

Sorry not easy for me to write a script to replicate, but easy to describe

  1. Create parquet file with columns that are case sensitive e.g. c1 & C1
  2. Create a table to receive these columns create table t_sensitive ("c1" int, "C1" int);
  3. Stage the parquet file
  4. Load parquet file via COPY INTO
    COPY INTO t_sensitive
    FROM 'gcs://bucket/folder/'
    CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>')
    FILE_FORMAT = (TYPE = PARQUET)
    FORCE = TRUE;

Are you willing to submit PR?

dosubot[bot] commented 1 week ago

The ambiguous column error you're encountering when using the COPY INTO command with case-sensitive columns (c1 and C1) is likely due to the logic at line 301 in bind_context.rs. This logic checks if there are multiple tables in the metadata that are not sources of an index, and if so, it raises a SemanticError indicating that the table of the internal column is ambiguous:

if tables.len() > 1 {
    return Err(ErrorCode::SemanticError(format!(
        "The table of the internal column `{}` is ambiguous",
        column_binding.internal_column.column_name()
    )));
}

This error can occur if the column names are not being correctly distinguished based on case sensitivity during the binding process [1].

To address this issue, you may need to ensure that the normalize_identifier function correctly handles case sensitivity for both quoted and unquoted identifiers. The normalize_identifier function is used to normalize identifiers based on the case sensitivity settings defined in the NameResolutionContext:

pub fn normalize_identifier(ident: &Identifier, context: &NameResolutionContext) -> Identifier {
    if (ident.is_quoted() && context.quoted_ident_case_sensitive)
        || (!ident.is_quoted() && context.unquoted_ident_case_sensitive)
    {
        ident.clone()
    } else {
        Identifier::from_name(ident.span, ident.name.to_lowercase())
    }
}

You should verify that the NameResolutionContext is configured correctly for your use case, ensuring that quoted identifiers are treated as case-sensitive and unquoted identifiers are handled according to your requirements [2].

If the issue persists, you might need to review the logic in the COPY INTO command's processing of Parquet files to ensure that it correctly handles case-sensitive column names. The ParquetCopySource struct and its methods (event, process, and async_process) are responsible for processing Parquet files during the COPY INTO operation [3][4][5].

By ensuring that the case sensitivity settings and the identifier normalization logic are correctly implemented, you should be able to resolve the ambiguous column error.

To continue talking to Dosu, mention @dosu.