apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.44k stars 2.22k forks source link

MinIO + Spark + hive metadata + iceberg format #10222

Closed rychu151 closed 5 months ago

rychu151 commented 6 months ago

Query engine

Spark

Question

Im trying to setup local develop env for my testing purposes using docker

Target is to save dataframe in a Iceberg format and Hive-metadata

Here is my current docker-compose:

version: "3"

services:

  #Jupyter Notebook with PySpark & iceberg Server
  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    networks:
      iceberg_net:
    depends_on:
      #- rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
      - ./spark-iceberg/spark/jars/nessie-spark-extensions-3.5_2.12-0.80.0.jar:/opt/spark/jars/nessie-spark-extensions-3.5_2.12-0.80.0.jar
      - ./spark-iceberg/spark/conf/spark-defaults.conf:/opt/spark/conf/spark-defaults.conf
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - USE_STREAM_CAPABLE_STATE_STORE=true
      - CATALOG_WAREHOUSE=s3://warehouse/
    ports:
      - "8888:8888"
      - "8080:8080"
      - "10000:10000"
      - "10001:10001"

  # Minio Storage Server
  minio:
    image: bitnami/minio:latest # not miniop/minio because of reported issues with the image
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_REGION=us-east-1
      - MINIO_REGION_NAME=us-east-1
    networks:
      iceberg_net:
        aliases:
          - warehouse.minio
    ports:
      - "9001:9001"
      - "9000:9000"

  #hive metastore
  hive-metastore:
    image: apache/hive:4.0.0
    container_name: hive-metastore
    networks:
      iceberg_net:
    ports:
      - "9083:9083"
    environment:
        - SERVICE_NAME=metastore
    depends_on:
      - zookeeper
      - postgres
    volumes:
        - ./hive_metastore/conf/hive-site.xml:/opt/hive/conf/hive-site.xml

spark-defaults.conf:

spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.hive_prod            org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive_prod.type       hive
spark.sql.catalog.hive_prod.uri        thrift://hive-metastore:9083

spark.sql.catalog.hive_prod.io-impl          org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.hive_prod.s3.endpoint      http://minio:9000
spark.sql.catalog.hive_prod.warehouse        s3://warehouse/
hive.metastore.uris                    thrift://hive-metastore:9083

and hive-site.xml

<configuration>
    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
    </property>
    <property>
        <name>hive.tez.exec.inplace.progress</name>
        <value>false</value>
    </property>
    <property>
        <name>hive.exec.scratchdir</name>
        <value>/opt/hive/scratch_dir</value>
    </property>
    <property>
        <name>hive.user.install.directory</name>
        <value>/opt/hive/install_dir</value>
    </property>
    <property>
        <name>tez.runtime.optimize.local.fetch</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.exec.submit.local.task.via.child</name>
        <value>false</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>local</value>
    </property>
    <property>
        <name>tez.local.mode</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
    </property>
    <property>
        <name>metastore.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>s3a://warehouse/</value>
    </property>
    <property>
        <name>fs.s3a.endpoint</name>
        <value>http://localhost:9000</value>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <value>admin</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>password</value>
    </property>
    <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
    </property>
    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>
    <property>
        <name>fs.s3a.connection.ssl.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>hive.metastore.authorization.storage.checks</name>
        <value>false</value>
        <description>Disables storage-based authorization checks to allow Hive better compatibility with MinIO.
        </description>
    </property>

</configuration>

using MinIO US i have created a bucket called warehouse and set it to public access

Target is to save dataframe in a Iceberg format and Hive-metadata so i will be able to browse this data using Apache Druid

in order to create a table i use PySpark:

col_name = "col_name"
label_name = "label"
data_name = "upload_date"

schema = StructType([
    StructField(data_name, LongType(), False),
    StructField(col_name, StringType(), False),
    StructField(label_name, StringType(), False)
])

spark = SparkSession.builder.appName("schema_example").enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.iceberg.catalog.hive_prod", "DEBUG")
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

data = []

df = spark.createDataFrame(data, schema)

spark.sql("SHOW DATABASES ").show() prints only default database

when i try to create a database like below: spark.sql('CREATE DATABASE IF NOT EXISTS hive_prod.testing')

i get the following error:

Py4JJavaError: An error occurred while calling o34.sql.
: java.lang.RuntimeException: Failed to create namespace testing in Hive Metastore
at org.apache.iceberg.hive.HiveCatalog.createNamespace(HiveCatalog.java:299)
Caused by: MetaException(message:Failed to create external path s3://warehouse/testing.db for database testing. This may result in access not being allowed if the StorageBasedAuthorizationProvider is enabled: null)

anyone understands why?

vinh22032000 commented 5 months ago

Hi, have you found the solution yet? i have the same problem when using hive 4.0 with minio:

pyspark.errors.exceptions.captured.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Failed to create external path s3a://wba/warehouse/wba.db for database wba. This may result in access not being allowed if the StorageBasedAuthorizationProvider is enabled: null)

rychu151 commented 5 months ago

I resigned from using hive-metastore. Nessie has no compatibility issues

mustafaaykon commented 15 hours ago

Hi @rychu151 , I saw you passed you passed 'fs.s3a.endpoint' as localhost. I think, It shouldn't be localhost because hive is working on different container and MinIO is working on different container. Did you try to set 'http://minio:9000' for fs.s3a.endpoint parameter?

xhuyvn commented 6 hours ago

Hi, have you found the solution yet? i have the same problem when using hive 4.0 with minio:

pyspark.errors.exceptions.captured.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Failed to create external path s3a://wba/warehouse/wba.db for database wba. This may result in access not being allowed if the StorageBasedAuthorizationProvider is enabled: null)

I had the same issue and solved by add this config in metastore-site.yml

`

hive.metastore.pre.event.listeners org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener

hive.security.metastore.authorization.manager org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider `