duckdb / duckdb

DuckDB is an in-process SQL OLAP Database Management System
http://www.duckdb.org
MIT License
17.73k stars 1.51k forks source link

sniff_csv: the Columns value cannot be parsed using DuckDB if header names contain single quote #12089

Closed no23reason closed 2 weeks ago

no23reason commented 2 weeks ago

What happens?

Problem When the sniff_csv function is used, the Columns that it returns cannot be parsed when any of the headers in the CSV contain a single quote character in them. If the column header contains double single quote, DuckDB interprets it as an SQL escape and outputs only a single quote when output.

Proposed solution The Columns output of the sniff_csv should escape the single quotes in the column names so that the value can be used safely from subsequent SQL commands.

To Reproduce

Run this script

import tempfile

import duckdb

def test(col_name: str):
    with tempfile.NamedTemporaryFile(suffix=".csv", mode="w") as file:
        file.write(f"Name,{col_name}\n")
        file.write("Alice,123\n")
        file.write("Bob,456\n")
        file.flush()

        with duckdb.connect(":memory:", config={"threads": 1}) as conn:
            (cols,) = conn.execute(
                "SELECT Columns FROM sniff_csv(?)",
                [file.name],
            ).fetchone()

            col_query = conn.execute(f"SELECT {cols}")
            column_names = col_query.fetchone()[0].keys()
            print(column_names)

test("Age")  # ['Name', 'Age']
test("Escaped''Quote")  # ['Name', "Escaped'Quote"]
test("Unescaped'Quote")  # duckdb.duckdb.ParserException: Parser Error: syntax error at or near "Quote"

If the column name has no single quotes it works fine. If the single quotes are doubled (i.e. SQL-escaped) it works but the result shows only one qute instead of the original two. Otherwise it raises an exception.

OS:

macOS aarch64

DuckDB Version:

0.10.2

DuckDB Client:

Python

Full Name:

Dan Homola

Affiliation:

GoodData

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?