apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.28k stars 912 forks source link

[Docs] A java example: how to connect s3 storage. #1144

Open leaves12138 opened 1 year ago

leaves12138 commented 1 year ago

Search before asking

Motivation

Docs about s3 storage is fuzzy, need a example explain how to link s3 storage.(We can use minio)

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

CodingGPT commented 1 year ago

please assigned to me tks, I want to try it

andreyolv commented 4 months ago

Any news?

A complete and clear example like the one below that I tried and it doesn't work would be great, does anyone know why?

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql import Window

# https://paimon.apache.org/docs/master/spark/quick-start/#preparation

spark = (SparkSession.builder
            .appName('Paimon')
            .config("spark.jars.packages", "org.apache.paimon:paimon-spark-3.3:0.7.0-incubating,"
                                           "org.apache.paimon:paimon-s3:0.7.0-incubating"
                   )
            # S3 / Minio 
            .config("spark.hadoop.fs.s3a.access.key", "XXXXXXXXX")
            .config("spark.hadoop.fs.s3a.secret.key", "XXXXXXXXX")
            .config("spark.hadoop.fs.s3a.endpoint", "http://minio.minio:9000")
            .config("spark.hadoop.fs.s3a.path.style.access", True)
            .config("spark.hadoop.fs.s3a.fast.upload", True)
            .config("spark.hadoop.fs.s3a.multipart.size", 104857600)
            .config("fs.s3a.connection.maximum", 100)
            .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
            .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
            # Paimon
            .config("spark.sql.catalog.paimon", "org.apache.paimon.spark.SparkCatalog")
            .config("spark.sql.catalog.paimon.s3.access-key", "XXXXXXXX")
            .config("spark.sql.catalog.paimon.s3.secret-key", "XXXXXXXXX")
            .config("spark.sql.catalog.paimon.s3.endpoint", "http://minio.minio:9000")
            .config("spark.sql.catalog.paimon.warehouse", "s3://lakehouse/paimon")
            .config("spark.sql.extensions", "org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions")
            .getOrCreate()
        )

spark version 3.3.0