Row filter parse exception on column starting with underscore

vincenzon commented 20 hours ago

Apache Iceberg version

0.8.0 (latest release)

Please describe the bug 🐞

A row_filter passed table overwrite throws a parse exception if the column begins with a underscore. The example below demonstrates the issue. I tried quoting the column name but that didn't help.

import pyarrow as pa
import pandas as pd

from pyiceberg.schema import Schema
from pyiceberg.types import StringType, NestedField
from pyiceberg.exceptions import NamespaceAlreadyExistsError
from pyiceberg.catalog.sql import SqlCatalog

catalog_path = "/tmp/iceberg_data"

pathlib.Path(catalog_path).mkdir(exist_ok=True)

catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{catalog_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{catalog_path}",
    },
)

try:
    catalog.create_namespace("default")
except NamespaceAlreadyExistsError:
    pass

df = pa.Table.from_pandas(pd.DataFrame(
    {
        "_X": ['a', 'b', 'c'],
        "YY": ['A','A','A'],
    }))

schema = Schema(
    NestedField(field_id=1, name='_X', field_type=StringType(), required=False), 
    NestedField(field_id=2, name='YY', field_type=StringType(), required=False), 
    schema_id=0,
    identifier_field_ids=[])

table = "default.foo"
try:
    catalog.drop_table(table)
except:
    pass

tbl = catalog.create_table(
    identifier=table,
    schema=schema,
)

tbl.append(df)

# These two examples raise ParseException:
tbl.overwrite(df, """  _X  == 'a' """)
#tbl.overwrite(df, """ "_X" == 'a' """)

# These two examples are fine:
#tbl.overwrite(df, """  YY  == 'A' """)
#tbl.overwrite(df, """ "YY" == 'A' """)

vincenzon commented 19 hours ago

It looks like this is the problem:

https://github.com/apache/iceberg-python/blob/7a83695330518bea0dee589b5b513297c4d59b66/pyiceberg/expressions/parser.py#L82

I think it should be:

unquoted_identifier = Word(alphas + "_", alphanums + "_$")

kevinjqliu commented 18 hours ago

thanks for reporting this. do you know if a leading underscore in column name is valid in spark sql?

vincenzon commented 17 hours ago

According to this: https://spark.apache.org/docs/latest/sql-ref-identifier.html it is allowed. In fact, the way quoting is handled by pyiceberg is wrong on two levels:

The quote should be backtick not double quote.
The logic here just qoutes an identifier that conforms to the unquoted specification. Quotes are typically used to allow non-conforming identifier names.

Fixing the quote character is easy, fixing the second issue would be more involved.

apache / iceberg-python

Row filter parse exception on column starting with underscore #1357

Apache Iceberg version

Please describe the bug 🐞