bug: Ambiguous column error when loading different case columns via COPY INTO

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

When trying to load data via COPY INTO from a parquet file with case sensitive columns, an ambiguous column error is reported. Looks to be from https://github.com/datafuselabs/databend/blob/2c41faf63b5136ab22a2ab3ac972c9752dc4a7bf/src/query/sql/src/planner/binder/bind_context.rs#L301

The parquet file contains columns such as c1 and C1 which exist within the target table.

How to Reproduce?

Sorry not easy for me to write a script to replicate, but easy to describe

Create parquet file with columns that are case sensitive e.g. c1 & C1
Create a table to receive these columns create table t_sensitive ("c1" int, "C1" int);
Stage the parquet file

Load parquet file via COPY INTO

COPY INTO t_sensitive
FROM 'gcs://bucket/folder/'
CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>')
FILE_FORMAT = (TYPE = PARQUET)
FORCE = TRUE;

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

The ambiguous column error you're encountering when using the COPY INTO command with case-sensitive columns (c1 and C1) is likely due to the logic at line 301 in bind_context.rs. This logic checks if there are multiple tables in the metadata that are not sources of an index, and if so, it raises a SemanticError indicating that the table of the internal column is ambiguous:

if tables.len() > 1 {
    return Err(ErrorCode::SemanticError(format!(
        "The table of the internal column `{}` is ambiguous",
        column_binding.internal_column.column_name()
    )));
}

This error can occur if the column names are not being correctly distinguished based on case sensitivity during the binding process ^[1].

To address this issue, you may need to ensure that the normalize_identifier function correctly handles case sensitivity for both quoted and unquoted identifiers. The normalize_identifier function is used to normalize identifiers based on the case sensitivity settings defined in the NameResolutionContext:

pub fn normalize_identifier(ident: &Identifier, context: &NameResolutionContext) -> Identifier {
    if (ident.is_quoted() && context.quoted_ident_case_sensitive)
        || (!ident.is_quoted() && context.unquoted_ident_case_sensitive)
    {
        ident.clone()
    } else {
        Identifier::from_name(ident.span, ident.name.to_lowercase())
    }
}

You should verify that the NameResolutionContext is configured correctly for your use case, ensuring that quoted identifiers are treated as case-sensitive and unquoted identifiers are handled according to your requirements ^[2].

If the issue persists, you might need to review the logic in the COPY INTO command's processing of Parquet files to ensure that it correctly handles case-sensitive column names. The ParquetCopySource struct and its methods (event, process, and async_process) are responsible for processing Parquet files during the COPY INTO operation ^[3]^[4]^[5].

By ensuring that the case sensitivity settings and the identifier normalization logic are correctly implemented, you should be able to resolve the ambiguous column error.

To continue talking to Dosu, mention @dosu.

databendlabs / databend