Closed zhangdove closed 4 years ago
Document is missing, but micro-batch sink is available for Spark structured streaming so you can just write directly without overwriting table (which means you're rewriting all records per batch).
This is the python code I'm experimenting with Iceberg. I've just written it to python to avoid long compilation - there's nothing specific to python/pyspark, so you can simply do the same with Scala as well.
@HeartSaVioR Thansks for your reply.Maybe there's something wrong with the description of my issue name.
I have tested some differences between spark.table("prod.db.table")
and spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
(Spark Structed Streaming), I wonder why
Sorry for that case I have no idea. I'm also starting to explore the project.
When Analyzing Iceberg's Catalog, I find that There is still an issue left here, and I have made some new discoveries:
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
By this way, Iceberg table loading does not use the Iceberg Catalog. Of course, Iceberg's metadata information will not be cached. Instead, Iceberg Table will be obtained directly by using IcebergSource.findTable(options,conf)
.
However, when Iceberg table is loaded using spark.table("prod.db.table")
, CachingCatalog(cache-enabled
default value is true) automatically looks for Iceberg table from the cache(Caffeine Cache).
Finally, whether it is incorrect that I find that the description of the document in this place?
The correct description should not be this ?
Using spark.table("prod.db.table") loads an isolated table reference that is not refreshed when other queries update the table.
@rdblue How do you think this description? Should we update this place?
At least I can close the current issue now.
I did some test consume kafka message, write to iceberg table by Spark structed streaming. I'm having some trouble.
1.My environment
2.Create Iceberg table
3.The pseudocode is as follows
4.Kafka Message
5.Result case one : read table by spark.table("prod.db.table"). Result : Read iceberg table is error(table is empty) when the second batch .
case two : read table by spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table"). Result : normal.
6.Question The phenomenon is
spark.table("prod.db.table")
is not refreshed iceberg table when the next batch. However,spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
does the opposite by automatically refreshing. Is there a difference betweenspark.table("prod.db.table")
andspark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
?I'm not sure if I'm using it the wrong way.
Link: https://iceberg.apache.org/spark/#querying-with-dataframes