Open c-thiel opened 2 months ago
Thanks! Nice work! @c-thiel
@ZENOTME one feature I did not implement yet is respecting "version.history.num-entries" as mentioned in the View Spec. I noticed that the table implementation also doesn't implement it. I don't think it matters much unless one is implementing a catalog, but I wanted to mention it.
Thanks for working on this @c-thiel. Out of curiosity, and probably this is a broader discussion, but currently, the view spec is around SQL representation. This might not be the most natural fit with dataframe-based systems. Do you have any thoughts on how to represent views in iceberg-rust?
@Fokko thats is a hard topic.
The idealist in me would like to eventually see something like substrait beeing used. Adoption of the project is very slow across engines though.
Maybe the more practical short-term solution would be to stick to representations and even limit to one dialect - the one of the "writer", including its technology. Then each client can decide if there is a transpiler good enough for that dialect. Funnily enough exactly this ambiguity - that there could be different dialects in the representation (say spark and trino) - gave me a lot of headaches. Which should I use if I am Datafusion? From which should I transpile? What to do if they are not identical?
Sticking to multiple representations is in my opinion only an option if we let the catalog handle this complexity. Unless substrait gets a significant boost, I would assume that the following approach presents the most stable API (Assuming there is only REST Catalog left ;) ):
This way a Rest-Catalog could even switch to Substrait internally in the future and start now with storing the presented SQL "as-is" without beeing able to serve any other dialect.
With SQLMesh coming up and also Substrait continuing there is at least some development in the area.
@Fokko from my side this is good to merge - types are complete and tests are passing.
@Fokko @ZENOTME are there any points open from your side that prevent us to merge the View Spec? If so, please let me know :)
The idealist in me would like to eventually see something like substrait beeing used.
There is an open discussion on evolving the spec into something like that. I think transpiling SQL will never work in the long run because there are so many variations and extensions to the SQL dialect (custom UDFs and such), maybe you can reach 90% of the SQL which is good enough.
@nastra would you have time to go over this PR?
@nastra, @Fokko during testing we found a Problem with the "default-namespace". I am hoping for some insights from your side:
According to the iceberg view spec, "default-namespace" is required: https://iceberg.apache.org/view-spec/#versions.
As such it is modeled as NamespaceIdent
:
https://github.com/c-thiel/iceberg-rust/blob/d8a7aabce73888e5b22712d60ca95b856db66fc2/crates/iceberg/src/spec/view_version.rs#L56
Valid Namespace Identifiers have at least one element, which is validated by the NamespaceIdent
constructor.
When creating views, spark creates Metadata objects that specify the "default-namespace" field but as an empty vec. This is a quite unlucky situation, I am not sure what the desired behavior is:
When creating views, spark creates Metadata objects that specify the "default-namespace" field but as an empty vec
Is this with plain OSS Spark or are you using the iceberg-spark-runtime.jar from Iceberg? Just asking, because we're properly handling the namespace when creating a view in https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51. This code then calls https://github.com/apache/iceberg/blob/81b3310ab469408022cc14af51257b7e8b36614f/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L593.
It would ne helpful to understand which Spark + Iceberg version you're using and how you're creating the view in Spark.
Hi @nastra,
this script:
import pyspark
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pandas as pd
CATALOG_URL = "http://server:8080/catalog"
MANAGEMENT_URL = "http://server:8080/management"
DEMO_WAREHOUSE = "demo"
config = {
"spark.sql.catalog.demo-catalog": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.demo-catalog.type": "rest",
"spark.sql.catalog.demo-catalog.uri": CATALOG_URL,
"spark.sql.catalog.demo-catalog.warehouse": DEMO_WAREHOUSE,
"spark.sql.catalog.demo-catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.defaultCatalog": "demo-catalog",
"spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.iceberg:iceberg-aws-bundle:1.5.2",
}
spark_config = SparkConf().setMaster('local').setAppName("Iceberg-REST")
for k, v in config.items():
spark_config = spark_config.set(k, v)
spark = SparkSession.builder.config(conf=spark_config).getOrCreate()
spark.sql("CREATE NAMESPACE IF NOT EXISTS spark_demo")
data = pd.DataFrame([[1, 'a-string', 2.2]], columns=['id', 'strings', 'floats'])
sdf = spark.createDataFrame(data)
sdf.writeTo("spark_demo.my_table").createOrReplace()
spark.sql("CREATE view spark_demo.vv as select * from spark_demo.my_table")
ends up sending that POST to /catalog/v1/df910b7e-3912-11ef-a048-6b2680efd54d/namespaces/spark_demo/views
{
"name": "vv",
"schema": {
"schema-id": 0,
"type": "struct",
"fields": [
{
"id": 0,
"name": "id",
"required": false,
"type": "long"
},
{
"id": 1,
"name": "strings",
"required": false,
"type": "string"
},
{
"id": 2,
"name": "floats",
"required": false,
"type": "double"
}
]
},
"view-version": {
"version-id": 1,
"schema-id": 0,
"timestamp-ms": 1719994069884,
"summary": {
"app-id": "local-1719994018459",
"engine-name": "spark",
"iceberg-version": "Apache Iceberg 1.5.2 (commit cbb853073e681b4075d7c8707610dceecbee3a82)",
"engine-version": "3.5.1"
},
"representations": [
{
"type": "sql",
"sql": "select * from spark_demo.my_table",
"dialect": "spark"
}
],
"default-namespace": []
},
"properties": {
"create_engine_version": "Spark 3.5.1",
"spark.query-column-names": "id,strings,floats",
"engine_version": "Spark 3.5.1"
}
}
Setting the namespace via:
spark.sql("USE spark_demo")
spark.sql("CREATE view vv3 as select * from spark_demo.mytable")
ends up setting default-namespace
:
{
"name": "vv3",
"schema": {
"schema-id": 0,
"type": "struct",
"fields": [
{
"id": 0,
"name": "id",
"required": false,
"type": "long"
},
{
"id": 1,
"name": "strings",
"required": false,
"type": "string"
},
{
"id": 2,
"name": "floats",
"required": false,
"type": "double"
}
]
},
"view-version": {
"version-id": 1,
"schema-id": 0,
"timestamp-ms": 1719994988301,
"summary": {
"engine-version": "3.5.1",
"app-id": "local-1719994018459",
"engine-name": "spark",
"iceberg-version": "Apache Iceberg 1.5.2 (commit cbb853073e681b4075d7c8707610dceecbee3a82)"
},
"representations": [
{
"type": "sql",
"sql": "select * from spark_demo.my_table",
"dialect": "spark"
}
],
"default-namespace": [
"spark_demo"
]
},
"properties": {
"spark.query-column-names": "id,strings,floats",
"engine_version": "Spark 3.5.1",
"create_engine_version": "Spark 3.5.1"
}
}
"default-namespace": []
indicates an empty namespace, not a null one, which is a valid case and there are tests for that here: https://github.com/apache/iceberg/blob/42a2c19cec31c626cbff6cc2dfafb86cdf223bd0/core/src/test/java/org/apache/iceberg/view/TestViewMetadataParser.java#L56. I've also opened https://github.com/apache/iceberg/pull/9890 a while ago to properly test empty namespaces with catalogs that support it.
However, why you're getting an empty namespace when creating the view with a qualified namespace is a different thing (which I need to look into and reproduce on my end - which I'll try to do before the end of the week)
Ok I checked the surrounding code and the handling is correct. In https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51 we're always determining the current namespace (which can be empty), which is then what ViewVersion#defaultNamespace
is set to. As I mentioned earlier that there's a difference between a null namespace and an empty one.
Spark later uses this info in https://github.com/apache/iceberg/blob/6bbf70a52ebccfaba4e7e08facd72b84b571e2a6/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkView.java#L75 in case there's no namespace in the underlying SQL and in your case you fully-qualified select * from spark_demo.my_table
with a namespace.
In your second example you configured the current namespace via USE <namespace>
, which was then evaluated https://github.com/apache/iceberg/blob/2a39af894f4f00aa37922ef765cc2583517fa1d1/spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateV2ViewExec.scala#L51 and passed to ViewVersion#defaultNamespace
(but not used when actually querying the view).
Implementation of the Iceberg V1 ViewMetadata