apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

add_files with RestCatalog, S3FileIO #11558

Open DongSeungLee opened 1 week ago

DongSeungLee commented 1 week ago

Query engine

Spark 3.5.3

Question

for study, i run spark cluster standalone in my local, and i have developed my own IcebergRestCatalog. My IcebergRestCatalog Iceberg spec is based on 1.6.1 version for running add_files provided by spark, like below.

CALL iceberg.system.add_files(
table => 'yearly_month_clicks',
source_table => '`parquet`.`s3a://dataquery-warehouse/iceberg/data`'
);

error occurs like below.

Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: s3://dataquery-warehouse/iceberg/dataquery/yearly_month_clicks/metadata/stage-31-task-1619-manifest-855c8009-c073-48b0-9fd7-e12c1daf8930.avro
    at org.apache.iceberg.hadoop.Util.getFs(Util.java:58)
    at org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
    at org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:97)
    at org.apache.iceberg.spark.SparkTableUtil.buildManifest(SparkTableUtil.java:368)
    at org.apache.iceberg.spark.SparkTableUtil.lambda$importSparkPartitions$1e94a719$1(SparkTableUtil.java:796)
    at org.apache.spark.sql.Dataset.$anonfun$mapPartitions$1(Dataset.scala:3414)
    at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:198)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
    at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
    at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)

from my point of view, spark try to create staging metadata from location of which iceberg table metadata has. here, iceberg metadata location is started with s3, and scheme is fixed as s3. Spark try to access file system by hadoop S3AFileSystem, thus it seems scheme s3 is not supported, s3a should be right scheme. how can i overcome this issue? thanks, sincerely

RussellSpitzer commented 17 hours ago

This is actually related to #11541 . Add Files uses some Hadoop Filesystem classes under the hood and because of this you currently must have a fully setup HadoopConfig in your runtime to do add_files. With #11541 completed we should be able to fix this for addfiles and use s3FileIO instead of hadoop filesystem classes

DongSeungLee commented 12 hours ago

i appreciate your sincere answer.