apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
635 stars 149 forks source link

Schema: Allow field name `foo.bar` even if struct foo->bar is present #591

Open c-thiel opened 2 months ago

c-thiel commented 2 months ago

Currently due to the way name-to-id is build, we cannot have points in columnames if it collides with a struct.

The following schema fails to build:

        let schema = Schema::builder()
            .with_schema_id(1)
            .with_fields(vec![
                NestedField::required(1, "foo", Primitive(PrimitiveType::String)).into(),
                NestedField::required(2, "foo.bar", Primitive(PrimitiveType::Int)).into(),
                NestedField::required(
                    3,
                    "foo",
                    Type::Struct(StructType::new(vec![NestedField::required(
                        4,
                        "bar",
                        Primitive(PrimitiveType::Int),
                    )
                    .into()])),
                )
                .into(),
            ])
            .build()
            .unwrap();

By prohibiting this we follow the Java implementation. There is nothing in the iceberg spec that prohibits these names, so I think we should allow them. For column names as a user I expect the same behaviors as for namespaces or databases with points - it needs to be escaped. As escaping depends on the query engine, and iceberg-rust as well as iceberg java has no SQL-parser, those libraries should not take away the option from the engine or make that decision in their stead.

I propose to change the representation of a colname to Vec<String> instead of just "String with points". It would also make accessor compatible between schemas - even if we decide to stick keep this artificial restriction.

c-thiel commented 2 months ago

CC @Fokko , @nastra , @Xuanwo

liurenjie1024 commented 2 months ago

Allowing such a case would also require a change on the table scan api, e.g. we need a way to allow user to tell use what foo.bar actually means.

liurenjie1024 commented 2 months ago

Also cc @rdblue

yukkit commented 1 month ago

FWIW. Can we introduce a structure similar to TableIdent to represent field paths, named 'FieldPath'?

  1. We could modify the elements of selected TableScan::column_names from string to FieldPath (and provide some utilities to simplify the construction of FieldPath).
  2. Change the key in name_to_id to FieldPath.
  3. Perhaps the logic for looking up fields via FieldPath would also need to be modified.
rdblue commented 1 month ago

This is an implementation detail that is not part of the spec. But I don't think it is worth bothering to enable both ["foo.bar"] and ["foo", "bar"] identifiers in the same schema. That's confusing for users, who will almost certainly be confused by tables with such odd structures.

The reference implementation has been this way for about 7 years and no one has every complained or, to my knowledge, hit a problem with this in practice. I highly recommend focusing time and effort on other improvements.