apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
474 stars 97 forks source link

View Spec implementation #331

Open c-thiel opened 2 months ago

c-thiel commented 2 months ago

Implementation of the Iceberg V1 ViewMetadata

ZENOTME commented 2 months ago

Thanks! Nice work! @c-thiel

c-thiel commented 2 months ago

@ZENOTME one feature I did not implement yet is respecting "version.history.num-entries" as mentioned in the View Spec. I noticed that the table implementation also doesn't implement it. I don't think it matters much unless one is implementing a catalog, but I wanted to mention it.

Fokko commented 2 months ago

Thanks for working on this @c-thiel. Out of curiosity, and probably this is a broader discussion, but currently, the view spec is around SQL representation. This might not be the most natural fit with dataframe-based systems. Do you have any thoughts on how to represent views in iceberg-rust?

c-thiel commented 2 months ago

@Fokko thats is a hard topic.

The idealist in me would like to eventually see something like substrait beeing used. Adoption of the project is very slow across engines though.

Maybe the more practical short-term solution would be to stick to representations and even limit to one dialect - the one of the "writer", including its technology. Then each client can decide if there is a transpiler good enough for that dialect. Funnily enough exactly this ambiguity - that there could be different dialects in the representation (say spark and trino) - gave me a lot of headaches. Which should I use if I am Datafusion? From which should I transpile? What to do if they are not identical?

Sticking to multiple representations is in my opinion only an option if we let the catalog handle this complexity. Unless substrait gets a significant boost, I would assume that the following approach presents the most stable API (Assuming there is only REST Catalog left ;) ):

This way a Rest-Catalog could even switch to Substrait internally in the future and start now with storing the presented SQL "as-is" without beeing able to serve any other dialect.

With SQLMesh coming up and also Substrait continuing there is at least some development in the area.

c-thiel commented 2 months ago

@Fokko from my side this is good to merge - types are complete and tests are passing.

c-thiel commented 1 week ago

@Fokko @ZENOTME are there any points open from your side that prevent us to merge the View Spec? If so, please let me know :)

Fokko commented 1 week ago

The idealist in me would like to eventually see something like substrait beeing used.

There is an open discussion on evolving the spec into something like that. I think transpiling SQL will never work in the long run because there are so many variations and extensions to the SQL dialect (custom UDFs and such), maybe you can reach 90% of the SQL which is good enough.

@nastra would you have time to go over this PR?

c-thiel commented 23 hours ago

@nastra, @Fokko during testing we found a Problem with the "default-namespace". I am hoping for some insights from your side: According to the iceberg view spec, "default-namespace" is required: https://iceberg.apache.org/view-spec/#versions. As such it is modeled as NamespaceIdent: https://github.com/c-thiel/iceberg-rust/blob/d8a7aabce73888e5b22712d60ca95b856db66fc2/crates/iceberg/src/spec/view_version.rs#L56

Valid Namespace Identifiers have at least one element, which is validated by the NamespaceIdent constructor.

When creating views, spark creates Metadata objects that specify the "default-namespace" field but as an empty vec. This is a quite unlucky situation, I am not sure what the desired behavior is:

  1. My feeling is that this is a bug in the Iceberg Spec. [] as "default-namespace" provides the same information as a non-required field (null), as a namespace without an element is unusable. However, views in spark work, which shows that default-namespace is not always required. Thus it should be optional.
  2. It's a bug in the spark implementation and in fact "default-namespace" should always contain at least one element.
  3. They are both right, in which case we would except [] as a valid namespace and must change the rust implementation.
  4. [] is in general not a valid namespace unless it is used as a "default-namespace" in which some special behavior is expected. This however brings me back to 1 - why not just use null in this case?
nastra commented 15 hours ago

When creating views, spark creates Metadata objects that specify the "default-namespace" field but as an empty vec

Is this with plain OSS Spark or are you using the iceberg-spark-runtime.jar from Iceberg? Just asking, because we're properly handling the namespace when creating a view in https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51. This code then calls https://github.com/apache/iceberg/blob/81b3310ab469408022cc14af51257b7e8b36614f/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L593.

It would ne helpful to understand which Spark + Iceberg version you're using and how you're creating the view in Spark.

twuebi commented 13 hours ago

Hi @nastra,

this script:

import pyspark
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pandas as pd

CATALOG_URL = "http://server:8080/catalog"
MANAGEMENT_URL = "http://server:8080/management"
DEMO_WAREHOUSE = "demo"

config = {
    "spark.sql.catalog.demo-catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.demo-catalog.type": "rest",
    "spark.sql.catalog.demo-catalog.uri": CATALOG_URL,
    "spark.sql.catalog.demo-catalog.warehouse": DEMO_WAREHOUSE,
    "spark.sql.catalog.demo-catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.defaultCatalog": "demo-catalog",
    "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.iceberg:iceberg-aws-bundle:1.5.2",
}

spark_config = SparkConf().setMaster('local').setAppName("Iceberg-REST")
for k, v in config.items():
    spark_config = spark_config.set(k, v)

spark = SparkSession.builder.config(conf=spark_config).getOrCreate()
spark.sql("CREATE NAMESPACE IF NOT EXISTS spark_demo")
data = pd.DataFrame([[1, 'a-string', 2.2]], columns=['id', 'strings', 'floats'])
sdf = spark.createDataFrame(data)
sdf.writeTo("spark_demo.my_table").createOrReplace()
spark.sql("CREATE view spark_demo.vv as select * from spark_demo.my_table")

ends up sending that POST to /catalog/v1/df910b7e-3912-11ef-a048-6b2680efd54d/namespaces/spark_demo/views

{
  "name": "vv",
  "schema": {
    "schema-id": 0,
    "type": "struct",
    "fields": [
      {
        "id": 0,
        "name": "id",
        "required": false,
        "type": "long"
      },
      {
        "id": 1,
        "name": "strings",
        "required": false,
        "type": "string"
      },
      {
        "id": 2,
        "name": "floats",
        "required": false,
        "type": "double"
      }
    ]
  },
  "view-version": {
    "version-id": 1,
    "schema-id": 0,
    "timestamp-ms": 1719994069884,
    "summary": {
      "app-id": "local-1719994018459",
      "engine-name": "spark",
      "iceberg-version": "Apache Iceberg 1.5.2 (commit cbb853073e681b4075d7c8707610dceecbee3a82)",
      "engine-version": "3.5.1"
    },
    "representations": [
      {
        "type": "sql",
        "sql": "select * from spark_demo.my_table",
        "dialect": "spark"
      }
    ],
    "default-namespace": []
  },
  "properties": {
    "create_engine_version": "Spark 3.5.1",
    "spark.query-column-names": "id,strings,floats",
    "engine_version": "Spark 3.5.1"
  }
}
twuebi commented 13 hours ago

Setting the namespace via:

spark.sql("USE spark_demo")
spark.sql("CREATE view vv3 as select * from spark_demo.mytable")

ends up setting default-namespace:

{
  "name": "vv3",
  "schema": {
    "schema-id": 0,
    "type": "struct",
    "fields": [
      {
        "id": 0,
        "name": "id",
        "required": false,
        "type": "long"
      },
      {
        "id": 1,
        "name": "strings",
        "required": false,
        "type": "string"
      },
      {
        "id": 2,
        "name": "floats",
        "required": false,
        "type": "double"
      }
    ]
  },
  "view-version": {
    "version-id": 1,
    "schema-id": 0,
    "timestamp-ms": 1719994988301,
    "summary": {
      "engine-version": "3.5.1",
      "app-id": "local-1719994018459",
      "engine-name": "spark",
      "iceberg-version": "Apache Iceberg 1.5.2 (commit cbb853073e681b4075d7c8707610dceecbee3a82)"
    },
    "representations": [
      {
        "type": "sql",
        "sql": "select * from spark_demo.my_table",
        "dialect": "spark"
      }
    ],
    "default-namespace": [
      "spark_demo"
    ]
  },
  "properties": {
    "spark.query-column-names": "id,strings,floats",
    "engine_version": "Spark 3.5.1",
    "create_engine_version": "Spark 3.5.1"
  }
}
nastra commented 12 hours ago

"default-namespace": [] indicates an empty namespace, not a null one, which is a valid case and there are tests for that here: https://github.com/apache/iceberg/blob/42a2c19cec31c626cbff6cc2dfafb86cdf223bd0/core/src/test/java/org/apache/iceberg/view/TestViewMetadataParser.java#L56. I've also opened https://github.com/apache/iceberg/pull/9890 a while ago to properly test empty namespaces with catalogs that support it.

However, why you're getting an empty namespace when creating the view with a qualified namespace is a different thing (which I need to look into and reproduce on my end - which I'll try to do before the end of the week)

nastra commented 11 hours ago

Ok I checked the surrounding code and the handling is correct. In https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51 we're always determining the current namespace (which can be empty), which is then what ViewVersion#defaultNamespace is set to. As I mentioned earlier that there's a difference between a null namespace and an empty one.

Spark later uses this info in https://github.com/apache/iceberg/blob/6bbf70a52ebccfaba4e7e08facd72b84b571e2a6/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkView.java#L75 in case there's no namespace in the underlying SQL and in your case you fully-qualified select * from spark_demo.my_table with a namespace.

In your second example you configured the current namespace via USE <namespace>, which was then evaluated https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51 and passed to ViewVersion#defaultNamespace (but not used when actually querying the view).