apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.28k stars 2.18k forks source link

Snowflake Iceberg Partitioned data read issue #9404

Open purna344 opened 9 months ago

purna344 commented 9 months ago

Feature Request / Improvement

We are using Snowflake Iceberg to read the data from the S3 location and that is working fine for the non partitioned data.

But If the data is partitioned and that partitioned is stored similar to legacy HDFS format like test3/country=USA/part-*1.c000.zstd.parquet where partitioned column "Country" and its value "USA" is stored in the file path and this column is not stored inside the parquet file, then Snowflake Iceberg is unable to read this data. Whereas other frameworks like Spark, Databricks are able to read this data. We contacted the Snowflake team and they said Iceberg is not able to recognize the partitioned column info in the folder path and Iceberg is expecting the partitioned column information inside the parquet file.

We try to read the same data by using Apace Spark Iceberg and it is working and if we try to access same data using Snowflake Iceberg then it is failing and unable to recognize the partition details.

To support this folder format through Snowflake Iceberg what changes needs to be done at application side like any config settings etc?

Please let us know how to fix this issue.

Query engine

None

amogh-jahagirdar commented 9 months ago

I recommend continue reaching out to Snowflake on any issues you are encountering on Iceberg integration, but the Spark behavior in the reported issue does seem really odd to me from an Iceberg perspective.

Ultimately, in Iceberg the source of truth for partitioning is the partition spec for the table. The advantage with decoupling logical partitioning from the physical organization of files is that it allows for safely and correctly evolving the partitioning as your data/query patterns change. Hive style partitioning in the path is irrelevant for Iceberg in terms of partition pruning and other planning related operations.

You mentioned: We try to read the same data by using Apace Spark Iceberg and it is working

When you say "it's working" are you querying with partition predicates and seeing pruning of partitions? Do you have partition spec defined in Iceberg or do you only have Hive style partitions? Without defining partitioning through Iceberg, I highly doubt any partition pruning would be happening (because the source of truth mentioned in the previous point).

Could you also share your Spark configs (redact any data that should be hidden)?

But as mentioned before, for any vendor related integrations with Iceberg, I recommend reaching out to the vendor.

rakesh-das08 commented 9 months ago

We also kind of faced this issue where when we try to create a new Iceberg table using snowflake as the catalog, using the provided syntax here : https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table#snowflake-as-the-iceberg-catalog, we could see that whatever columns we defined in cluster by (lets say month) is not actually reflected in the folder structure. We have reached out to Snowflake team for the same.

purna344 commented 9 months ago

If the producers write the data in storage by setting the below config value spark.conf.set("spark.databricks.delta.writePartitionColumnsToParquet", "false") Then *.parquet file does not have the partition columns related information and partition values are stored in the file path. It is not possible for us to ask producers don't set this config value in their spark jobs and publish the data. I heard that Iceberg format expect the partition values in the parquet file. How to handle this scenario and does iceberg support any config parameter for to read the partition values from the folder path? CC: @amogh-jahagirdar

zhongyujiang commented 9 months ago

@purna344 I have used migration procedure to migrate partitioned Parquet tables to Iceberg tables(no partition columns existing in Parquet files either), based on my experience, Spark can handle Parquet files without partition columns as long as these Parquet files' partition metadata is correct in iceberg's metadata. This is because the Spark reader can infer the constant partition values from the partition metadata(instead of inferring from the folder path), it does not require reading of the constant partition column actually. Not sure if this is also the case with your table, I think you can use files metadata table to determine whether the file's partition metadata is correct.

zhongyujiang commented 9 months ago

https://github.com/apache/iceberg/blob/31e31fd819c846f49d2bd459b8bfadfdc3c2bc3a/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java#L201-L208

sfc-gh-rortloff commented 6 months ago

@purna344 this issue has been addressed in a recent build. Let me know if it is not working for you.

findinpath commented 6 months ago

@sfc-gh-rortloff i went through the Snowflake documenation https://docs.snowflake.com/en/sql-reference/sql/create-iceberg-table and don't see any reference related to partitioning.

Could you pls sketch here how to create via Snowflake SQL syntax an Iceberg partitioned table?

tnatssb commented 5 months ago

@sfc-gh-rortloff is there plans for Snowflake to support Iceberg partitions? This seems like a very basic feature you should support.