apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
919 stars 147 forks source link

when spark-sql (using hive) to create iceberg data in S3, it doesn't generate version-hint.text #464

Closed alberttwong closed 4 months ago

alberttwong commented 5 months ago

Search before asking

Please describe the bug 🐞

Using iceberg via spark-sql normally. When I write the iceberg data, it doesn't generate version-hint.text.

related https://github.com/apache/incubator-xtable/discussions/463#discussioncomment-9706867

Using iceberg hive

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.2,org.apache.iceberg:iceberg-aws-bundle:1.5.2,org.apache.hadoop:hadoop-client:2.10.2,com.amazonaws:aws-java-sdk-s3:1.11.271,org.apache.hadoop:hadoop-aws:2.10.2 \
    --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
    --conf spark.sql.defaultCatalog=iceberg \
    --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg.warehouse=s3://warehouse \
    --conf spark.sql.catalog.iceberg.type=hive
CREATE SCHEMA iceberg_db LOCATION 's3a://warehouse/';
CREATE TABLE iceberg_db.taxis 
(
  vendor_id bigint,
  trip_id bigint,
  trip_distance float,
  fare_amount double,
  store_and_fwd_flag string
)
PARTITIONED BY (vendor_id) ;
INSERT INTO iceberg_db.taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');

it gives this error

root@spark:/opt/xtable/jars# export AWS_SECRET_ACCESS_KEY=password
root@spark:/opt/xtable/jars# export AWS_ACCESS_KEY_ID=admin
root@spark:/opt/xtable/jars# export ENDPOINT=http://minio:9000
root@spark:/opt/xtable/jars# export AWS_REGION=us-east-1
root@spark:/opt/xtable/jars# cd /opt/xtable/jars/; java -jar xtable-utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig xtable_iceberg.yaml -p core-site.xml
WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.
2024-06-07 19:54:51 INFO  org.apache.xtable.utilities.RunSync:148 - Running sync for basePath s3a://warehouse/taxis for following table formats [HUDI, DELTA]
2024-06-07 19:54:51 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from s3a://warehouse/taxis
2024-06-07 19:54:51 WARN  org.apache.hadoop.util.NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-06-07 19:54:51 WARN  org.apache.hadoop.metrics2.impl.MetricsConfig:136 - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
2024-06-07 19:54:52 WARN  org.apache.hadoop.fs.s3a.SDKV2Upgrade:39 - Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.DefaultAWSCredentialsProviderChain. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
2024-06-07 19:54:52 INFO  org.apache.xtable.hudi.HudiTableManager:73 - Hudi table does not exist, will be created on first sync
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/xtable/jars/xtable-utilities-0.1.0-SNAPSHOT-bundled.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2024-06-07 19:54:53 INFO  org.apache.spark.sql.delta.storage.DelegatingLogStore:60 - LogStore `LogStoreAdapter(io.delta.storage.S3SingleDriverLogStore)` is used for scheme `s3a`
2024-06-07 19:54:53 INFO  org.apache.spark.sql.delta.DeltaLog:60 - Creating initial snapshot without metadata, because the directory is empty
2024-06-07 19:54:54 INFO  org.apache.spark.sql.delta.InitialSnapshot:60 - [tableId=95c3e958-7fec-4917-bb9d-28bdf4504d33] Created snapshot InitialSnapshot(path=s3a://warehouse/taxis/_delta_log, version=-1, metadata=Metadata(bd69b4e8-e7de-4df2-b8fe-6ada0e1d0cc8,null,null,Format(parquet,Map()),null,List(),Map(),Some(1717790094004)), logSegment=LogSegment(s3a://warehouse/taxis/_delta_log,-1,List(),None,-1), checksumOpt=None)
2024-06-07 19:54:54 INFO  org.apache.xtable.conversion.ConversionController:240 - No previous InternalTable sync for target. Falling back to snapshot sync.
2024-06-07 19:54:54 INFO  org.apache.xtable.conversion.ConversionController:240 - No previous InternalTable sync for target. Falling back to snapshot sync.
2024-06-07 19:54:54 WARN  org.apache.iceberg.hadoop.HadoopTableOperations:325 - Error reading version hint file s3a://warehouse/taxis/metadata/version-hint.text
java.io.FileNotFoundException: No such file or directory: s3a://warehouse/taxis/metadata/version-hint.text
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3801) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3652) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.extractOrFetchSimpleFileStatus(S3AFileSystem.java:5288) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$executeOpen$6(S3AFileSystem.java:1578) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java:1576) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1550) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:997) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.iceberg.hadoop.HadoopTableOperations.findVersion(HadoopTableOperations.java:318) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.iceberg.hadoop.HadoopTableOperations.refresh(HadoopTableOperations.java:104) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.iceberg.hadoop.HadoopTableOperations.current(HadoopTableOperations.java:84) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.iceberg.hadoop.HadoopTables.load(HadoopTables.java:94) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergTableManager.lambda$getTable$1(IcebergTableManager.java:58) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at java.util.Optional.orElseGet(Unknown Source) [?:?]
        at org.apache.xtable.iceberg.IcebergTableManager.getTable(IcebergTableManager.java:58) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergConversionSource.initSourceTable(IcebergConversionSource.java:81) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergConversionSource.getSourceTable(IcebergConversionSource.java:60) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergConversionSource.getCurrentSnapshot(IcebergConversionSource.java:121) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.spi.extractor.ExtractFromSource.extractSnapshot(ExtractFromSource.java:38) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.conversion.ConversionController.syncSnapshot(ConversionController.java:183) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.conversion.ConversionController.sync(ConversionController.java:121) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:169) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
2024-06-07 19:54:54 ERROR org.apache.xtable.utilities.RunSync:171 - Error running sync for s3a://warehouse/taxis
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at location: s3a://warehouse/taxis
        at org.apache.iceberg.hadoop.HadoopTables.load(HadoopTables.java:97) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergTableManager.lambda$getTable$1(IcebergTableManager.java:58) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at java.util.Optional.orElseGet(Unknown Source) ~[?:?]
        at org.apache.xtable.iceberg.IcebergTableManager.getTable(IcebergTableManager.java:58) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergConversionSource.initSourceTable(IcebergConversionSource.java:81) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergConversionSource.getSourceTable(IcebergConversionSource.java:60) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.iceberg.IcebergConversionSource.getCurrentSnapshot(IcebergConversionSource.java:121) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.spi.extractor.ExtractFromSource.extractSnapshot(ExtractFromSource.java:38) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.conversion.ConversionController.syncSnapshot(ConversionController.java:183) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.conversion.ConversionController.sync(ConversionController.java:121) ~[xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
        at org.apache.xtable.utilities.RunSync.main(RunSync.java:169) [xtable-utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
sh-5.1# mc alias set warehouse http://minio:9000 admin password
Added `warehouse` successfully.
sh-5.1# mc ls -r warehouse
[2024-06-07 20:01:59 UTC]     0B STANDARD warehouse/taxis/_delta_log/
[2024-06-07 20:00:59 UTC] 1.5KiB STANDARD warehouse/taxis/data/vendor_id=1/00000-10-e0fc3ef1-3606-4591-bcf6-d72b25747380-0-00001.parquet
[2024-06-07 19:59:51 UTC] 1.5KiB STANDARD warehouse/taxis/data/vendor_id=1/00000-5-865f6992-e612-49a0-a8db-a27fd0f7d02a-0-00001.parquet
[2024-06-07 20:00:59 UTC] 1.5KiB STANDARD warehouse/taxis/data/vendor_id=2/00000-10-e0fc3ef1-3606-4591-bcf6-d72b25747380-0-00002.parquet
[2024-06-07 19:59:51 UTC] 1.5KiB STANDARD warehouse/taxis/data/vendor_id=2/00000-5-865f6992-e612-49a0-a8db-a27fd0f7d02a-0-00002.parquet
[2024-06-07 19:59:24 UTC] 1.4KiB STANDARD warehouse/taxis/metadata/00000-77bdf818-507f-48f1-971a-1898c294bf49.metadata.json
[2024-06-07 19:59:51 UTC] 2.4KiB STANDARD warehouse/taxis/metadata/00001-ef7e1726-2ec0-4582-bef8-9c37f96e2909.metadata.json
[2024-06-07 20:00:59 UTC] 3.4KiB STANDARD warehouse/taxis/metadata/00002-15d50d43-f5d8-4faa-b2b7-f41aca3f758f.metadata.json
[2024-06-07 19:59:51 UTC] 7.0KiB STANDARD warehouse/taxis/metadata/ae7384d8-6b4a-4fd6-bbc2-e9b621ba9e0b-m0.avro
[2024-06-07 20:00:59 UTC] 7.0KiB STANDARD warehouse/taxis/metadata/ebef7e50-73aa-4428-98aa-6ad0a8ed7802-m0.avro
[2024-06-07 19:59:51 UTC] 4.1KiB STANDARD warehouse/taxis/metadata/snap-1202211864160811787-1-ae7384d8-6b4a-4fd6-bbc2-e9b621ba9e0b.avro
[2024-06-07 20:00:59 UTC] 4.2KiB STANDARD warehouse/taxis/metadata/snap-3391991307049980362-1-ebef7e50-73aa-4428-98aa-6ad0a8ed7802.avro

Are you willing to submit PR?

Code of Conduct

alberttwong commented 5 months ago

maybe this is the hint. https://github.com/apache/incubator-xtable/issues/431#issuecomment-2104843042. Switching from hive to hadoop type.

dipankarmazumdar commented 4 months ago

@alberttwong - yeah looks like the same thing. You use a Hive catalog to create the Iceberg table but there are no configs. So if you are not bound to hive, you can use a file system-based catalog like Hadoop in Iceberg.

alberttwong commented 4 months ago

gist is that you have to use type=hadoop or else it won't generate the version-hint.text file.