apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
479 stars 176 forks source link

Row filter parse exception on column starting with underscore #1357

Closed vincenzon closed 1 hour ago

vincenzon commented 20 hours ago

Apache Iceberg version

0.8.0 (latest release)

Please describe the bug 🐞

A row_filter passed table overwrite throws a parse exception if the column begins with a underscore. The example below demonstrates the issue. I tried quoting the column name but that didn't help.

import pyarrow as pa
import pandas as pd

from pyiceberg.schema import Schema
from pyiceberg.types import StringType, NestedField
from pyiceberg.exceptions import NamespaceAlreadyExistsError
from pyiceberg.catalog.sql import SqlCatalog

catalog_path = "/tmp/iceberg_data"

pathlib.Path(catalog_path).mkdir(exist_ok=True)

catalog = SqlCatalog(
    "default",
    **{
        "uri": f"sqlite:///{catalog_path}/pyiceberg_catalog.db",
        "warehouse": f"file://{catalog_path}",
    },
)

try:
    catalog.create_namespace("default")
except NamespaceAlreadyExistsError:
    pass

df = pa.Table.from_pandas(pd.DataFrame(
    {
        "_X": ['a', 'b', 'c'],
        "YY": ['A','A','A'],
    }))

schema = Schema(
    NestedField(field_id=1, name='_X', field_type=StringType(), required=False), 
    NestedField(field_id=2, name='YY', field_type=StringType(), required=False), 
    schema_id=0,
    identifier_field_ids=[])

table = "default.foo"
try:
    catalog.drop_table(table)
except:
    pass

tbl = catalog.create_table(
    identifier=table,
    schema=schema,
)

tbl.append(df)

# These two examples raise ParseException:
tbl.overwrite(df, """  _X  == 'a' """)
#tbl.overwrite(df, """ "_X" == 'a' """)

# These two examples are fine:
#tbl.overwrite(df, """  YY  == 'A' """)
#tbl.overwrite(df, """ "YY" == 'A' """)
vincenzon commented 19 hours ago

It looks like this is the problem:

https://github.com/apache/iceberg-python/blob/7a83695330518bea0dee589b5b513297c4d59b66/pyiceberg/expressions/parser.py#L82

I think it should be:

unquoted_identifier = Word(alphas + "_", alphanums + "_$")
kevinjqliu commented 18 hours ago

thanks for reporting this. do you know if a leading underscore in column name is valid in spark sql?

vincenzon commented 17 hours ago

According to this: https://spark.apache.org/docs/latest/sql-ref-identifier.html it is allowed. In fact, the way quoting is handled by pyiceberg is wrong on two levels:

Fixing the quote character is easy, fixing the second issue would be more involved.