apache / incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.
https://xtable.apache.org/
Apache License 2.0
915 stars 145 forks source link

ONeTable with MINIO Buckets #327

Closed soumilshah1995 closed 7 months ago

soumilshah1995 commented 9 months ago

Hello there I have MINIO bucket locally huditest Screenshot 2024-02-08 at 3 49 26 PM

I am trying to use onetable with MINIO

docker-compose file

version: "3"

services:

  metastore_db:
    image: postgres:11
    hostname: metastore_db
    ports:
      - 5432:5432
    environment:
      POSTGRES_USER: hive
      POSTGRES_PASSWORD: hive
      POSTGRES_DB: metastore

  hive-metastore:
    hostname: hive-metastore
    image: 'starburstdata/hive:3.1.2-e.18'
    ports:
      - '9083:9083' # Metastore Thrift
    environment:
      HIVE_METASTORE_DRIVER: org.postgresql.Driver
      HIVE_METASTORE_JDBC_URL: jdbc:postgresql://metastore_db:5432/metastore
      HIVE_METASTORE_USER: hive
      HIVE_METASTORE_PASSWORD: hive
      HIVE_METASTORE_WAREHOUSE_DIR: s3://datalake/
      S3_ENDPOINT: http://minio:9000
      S3_ACCESS_KEY: admin
      S3_SECRET_KEY: password
      S3_PATH_STYLE_ACCESS: "true"
      REGION: ""
      GOOGLE_CLOUD_KEY_FILE_PATH: ""
      AZURE_ADL_CLIENT_ID: ""
      AZURE_ADL_CREDENTIAL: ""
      AZURE_ADL_REFRESH_URL: ""
      AZURE_ABFS_STORAGE_ACCOUNT: ""
      AZURE_ABFS_ACCESS_KEY: ""
      AZURE_WASB_STORAGE_ACCOUNT: ""
      AZURE_ABFS_OAUTH: ""
      AZURE_ABFS_OAUTH_TOKEN_PROVIDER: ""
      AZURE_ABFS_OAUTH_CLIENT_ID: ""
      AZURE_ABFS_OAUTH_SECRET: ""
      AZURE_ABFS_OAUTH_ENDPOINT: ""
      AZURE_WASB_ACCESS_KEY: ""
      HIVE_METASTORE_USERS_IN_ADMIN_ROLE: "admin"
    depends_on:
      - metastore_db
    healthcheck:
      test: bash -c "exec 6<> /dev/tcp/localhost/9083"

  minio:
    image: minio/minio
    environment:

      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
    networks:
      default:
        aliases:
          - warehouse.minio
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]

  mc:
    depends_on:
      - minio
    image: minio/mc
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      tail -f /dev/null
      "

volumes:
  hive-metastore-postgresql:

networks:
  default:
     name: hudi

hudi_job.py

try:
    import os
    import sys
    import uuid
    import pyspark
    import datetime
    from pyspark.sql import SparkSession
    from pyspark import SparkConf, SparkContext
    from faker import Faker
    import datetime
    from datetime import datetime
    import random
    import pandas as pd  # Import Pandas library for pretty printing

    print("Imports loaded ")

except Exception as e:
    print("error", e)

HUDI_VERSION = '0.14.0'
SPARK_VERSION = '3.4'

SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION},org.apache.hadoop:hadoop-aws:3.3.2 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable

spark = SparkSession.builder \
    .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
    .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
    .config('className', 'org.apache.hudi') \
    .config('spark.sql.hive.convertMetastoreParquet', 'false') \
    .getOrCreate()

print(spark)
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "http://127.0.0.1:9000/")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "admin")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "password")
spark._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider",
                                     "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")

global faker
faker = Faker()

def get_customer_data(total_customers=2):
    customers_array = []
    for i in range(0, total_customers):
        customer_data = {
            "customer_id": str(uuid.uuid4()),
            "name": faker.name(),
            "state": faker.state(),
            "city": faker.city(),
            "email": faker.email(),
            "created_at": datetime.now().isoformat().__str__(),
            "address": faker.address(),
            "salary": faker.random_int(min=30000, max=100000)
        }
        customers_array.append(customer_data)
    return customers_array

global total_customers, order_data_sample_size
total_customers = 5000
customer_data = get_customer_data(total_customers=total_customers)
spark_df_customers = spark.createDataFrame(data=[tuple(i.values()) for i in customer_data],
                                           schema=list(customer_data[0].keys()))
spark_df_customers.show(3)

def write_to_hudi(spark_df,
                  table_name,
                  db_name,
                  method='upsert',
                  table_type='COPY_ON_WRITE',
                  recordkey='',
                  precombine='',
                  partition_fields=''
                  ):
    path = f"s3a://huditest/hudi/database={db_name}/table_name={table_name}"
    # path = f"file:///Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/hudi/{db_name}/{table_name}"

    hudi_options = {
        'hoodie.table.name': table_name,
        'hoodie.datasource.write.table.type': table_type,
        'hoodie.datasource.write.table.name': table_name,
        'hoodie.datasource.write.operation': method,
        'hoodie.datasource.write.recordkey.field': recordkey,
        'hoodie.datasource.write.precombine.field': precombine,
        "hoodie.datasource.write.partitionpath.field": partition_fields,

        "hoodie.datasource.hive_sync.database": db_name,
        "hoodie.datasource.hive_sync.table": table_name,
        "hoodie.datasource.hive_sync.partition_fields": partition_fields,
        "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        "hoodie.datasource.hive_sync.metastore.uris": "thrift://localhost:9083",
        "hoodie.datasource.hive_sync.mode": "hms",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.write.hive_style_partitioning": "true"
    }

    print("\n")
    print(path)
    print("\n")

    spark_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(path)

write_to_hudi(
    spark_df=spark_df_customers,
    db_name="default",
    table_name="customers",
    recordkey="customer_id",
    precombine="created_at",
    partition_fields="state"
)

config.yml

sourceFormat: HUDI
targetFormats:
  - DELTA
datasets:
  -
    tableBasePath: s3a://huditest/hudi/database=default/table_name=customers
    tableName: customers
    partitionSpec: state:VALUE

Error

soumilshah@Soumils-MacBook-Pro StarRocks-Hudi-Minio % java \
-jar  /Users/soumilshah/IdeaProjects/SparkProject/MyGIt/StarRocks-Hudi-Minio/jar/utilities-0.1.0-beta1-bundled.jar \
--datasetConfig ./my_config.yaml
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/Users/soumilshah/IdeaProjects/SparkProject/MyGIt/StarRocks-Hudi-Minio/jar/utilities-0.1.0-beta1-bundled.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-02-08 15:48:18 INFO  io.onetable.utilities.RunSync:141 - Running sync for basePath s3://huditest/hudi/database=default/table_name=customers/ for following table formats [DELTA]
2024-02-08 15:48:18 ERROR io.onetable.utilities.RunSync:164 - Error running sync for s3://huditest/hudi/database=default/table_name=customers/
org.apache.hudi.exception.HoodieIOException: Could not check if s3://huditest/hudi/database=default/table_name=customers is a valid table
    at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:59) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:140) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:692) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:85) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:774) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at io.onetable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:42) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at io.onetable.hudi.HudiSourceClientProvider.getSourceClientInstance(HudiSourceClientProvider.java:31) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at io.onetable.client.OneTableClient.sync(OneTableClient.java:97) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at io.onetable.utilities.RunSync.main(RunSync.java:162) ~[utilities-0.1.0-beta1-bundled.jar:?]
Caused by: java.nio.file.AccessDeniedException: s3://huditest/hudi/database=default/table_name=customers/.hoodie: getFileStatus on s3://huditest/hudi/database=default/table_name=customers/.hoodie: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 8K4R3M1SP8D8VJYF; S3 Extended Request ID: t8D2C5IazpRmCbGGH0Hofx1FYf1KqHHZbgw5iSR/xWxw/bOt4f8iSbtszyGFP1PaPYss/Vi1XwdvMZX4wRIUrTk7RIZ5OUS3taz+jCMzLyk=; Proxy: null), S3 Extended Request ID: t8D2C5IazpRmCbGGH0Hofx1FYf1KqHHZbgw5iSR/xWxw/bOt4f8iSbtszyGFP1PaPYss/Vi1XwdvMZX4wRIUrTk7RIZ5OUS3taz+jCMzLyk=:403 Forbidden
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:230) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:151) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2275) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2226) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2160) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51) ~[utilities-0.1.0-beta1-bundled.jar:?]
    ... 8 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 8K4R3M1SP8D8VJYF; S3 Extended Request ID: t8D2C5IazpRmCbGGH0Hofx1FYf1KqHHZbgw5iSR/xWxw/bOt4f8iSbtszyGFP1PaPYss/Vi1XwdvMZX4wRIUrTk7RIZ5OUS3taz+jCMzLyk=; Proxy: null)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1307) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1304) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2264) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2226) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2160) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$getFileStatus$17(HoodieWrapperFileSystem.java:410) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:114) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:404) ~[utilities-0.1.0-beta1-bundled.jar:?]
    at org.apache.hudi.exception.TableNotFoundException.checkTableValidity(TableNotFoundException.java:51) ~[utilities-0.1.0-beta1-bundled.jar:?]
    ... 8 more
soumilshah@Soumils-MacBook-Pro StarRocks-Hudi-Minio % 

How to configure one table to work with MINO buckets ? I tried s3 and s3a both how would I setup to work with minio buckets

alberttwong commented 9 months ago

related: https://github.com/onetable-io/onetable/issues/322

alberttwong commented 8 months ago

https://stackoverflow.com/questions/78044311/using-minio-how-to-authenticate-amazon-s3-endpoint-in-java

soumilshah1995 commented 8 months ago

@alberttwong did you get it to work with uneatable ?

soumilshah1995 commented 8 months ago

still same issue @alberttwong

export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
export S3_ENDPOINT=http://localhost:9000
export AWS_ENDPOINT_URL_S3=http://localhost:9000
export AWS_ENDPOINT=http://localhost:9000

java \
-jar  /Users/soumilshah/IdeaProjects/SparkProject/MyGIt/StarRocks-Hudi-Minio/jar/utilities-0.1.0-beta1-bundled.jar \
--datasetConfig ./config.yml
alberttwong commented 8 months ago

@sagarlakshmipathy is there something we're missing? I feel like the java app doesn't pick up ENV.

alberttwong commented 8 months ago
[root@spark-hudi auxjars]# cat ~/.aws/config 
[default]
region = us-west-2
output = json

[services testing-s3]
s3 = 
  endpoint_url = http://minio:9000
[root@spark-hudi auxjars]# cat ~/.aws/credentials 
[default]
aws_access_key_id = admin
aws_secret_access_key = password
[root@spark-hudi auxjars]# aws s3 ls --endpoint http://minio:9000
2024-02-22 21:23:53 huditest
2024-02-22 21:18:19 warehouse
[root@spark-hudi auxjars]# env|grep AWS
AWS_IGNORE_CONFIGURED_ENDPOINT_URLS=true
AWS_REGION=us-east-1
AWS_ENDPOINT_URL_S3=http://minio:9000
alberttwong commented 8 months ago

okay. I figured it out. You need to modify utilities/src/main/resources/onetable-hadoop-defaults.xml to include additional configs. https://github.com/onetable-io/onetable/pull/337 We need to clean this up so that onetable scans for conf files.

also we need to modify the trino schema create command.

See https://github.com/StarRocks/demo/issues/54 for all the instructions.

soumilshah1995 commented 8 months ago

im lost lol which variables do we need to set ?

I tried

export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
export S3_ENDPOINT=http://localhost:9000
export AWS_ENDPOINT_URL_S3=http://localhost:9000
export AWS_ENDPOINT=http://localhost:9000
image

are this all variables I need to set ?

alberttwong commented 8 months ago

That's the issue... none of them worked so I'm just listing all the variants I tried. what worked was only for my own demo and I modified my hadoop settings. I couldn't get it working for onetable's docker demo.

soumilshah1995 commented 8 months ago

understood let me post the ticket in hudi channel maybe someone can help there

nfarah86 commented 8 months ago

@the-other-tim-brown is this something you can help here?

sagarlakshmipathy commented 8 months ago

@soumilshah1995 @alberttwong I'm on vacation until Thursday, I can help replicate this issue once I'm back. @the-other-tim-brown feel free to chime in when you get a chance.

soumilshah1995 commented 8 months ago

@sagarlakshmipathy Thanks a lot. Please enjoy your vacations this is not urgent or blocking its mostly for POC no Hurry at all

the-other-tim-brown commented 8 months ago

I have not used MinIO before. I will have to spend some time coming up to speed on it.

the-other-tim-brown commented 8 months ago

@soumilshah1995 have you tried updating the hadoop config like Albert suggested? He put up the configs he used here: https://github.com/apache/incubator-xtable/pull/337/files

soumilshah1995 commented 8 months ago

I haven't opted for using a container for my Hadoop setup. Could you kindly suggest the steps to set it up on a Mac?"

alberttwong commented 7 months ago

Soumil... I think if slightly change your java run command to java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig onetable.yaml -p ../conf/core-site.xml, it should work. The raw for the xml can be found at https://github.com/StarRocks/demo/blob/master/documentation-samples/datalakehouse/conf/core-site.xml

soumilshah1995 commented 7 months ago

I think I will take a different route rather using deltastreamer and xtable with MINIO and StarRocks I need to try that will try it and keep you all posted here

soumilshah1995 commented 7 months ago

I have decided to go this route instead Screenshot 2024-03-29 at 9 46 18 AM

soumilshah1995 commented 7 months ago

here is my architecture image

alberttwong commented 5 months ago

https://atwong.medium.com/what-you-need-to-have-spark-read-and-write-in-s3-specifically-apache-iceberg-apache-hudi-delta-c3a976adb603