Open K0nkere opened 1 year ago
Visit Maven Repository for two .jars hadoop-aws-<version> and aws-java-sdk-bundle-<version> in Compile Dependencies Create a folder lib and download into it
.jars
hadoop-aws-<version>
aws-java-sdk-bundle-<version>
lib
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
Creating Spark Session and conf
import pyspark import os from pyspark.sql import SparkSession from pyspark.conf import SparkConf from pyspark.context import SparkContext conf = SparkConf() \ .setMaster("local[*]") \ .setAppName("test") \ .set("spark.jars", "/homw/winx/de-zoomcamp/lib/aws-java-sdk-bundle-1.12.262.jar, /homw/winx/de-zoomcamp/lib/hadoop-aws-3.3.4.jar ") AWS_ACCESS_KEY_ID = os.getenv("aws_access_key_id") AWS_SECRET_ACCESS_KEY = os.getenv("aws_secret_access_key") sc = SparkContext(conf=conf) hadoop_conf = sc._jsc.hadoopConfiguration() hadoop_conf.set("fs.s3a.path.style.access", "true") hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") hadoop_conf.set("fs.s3a.signing-algorithm", "") hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") hadoop_conf.set("fs.s3a.endpoint", "storage.yandexcloud.net") hadoop_conf.set("fs.s3a.access.key", AWS_ACCESS_KEY_ID) hadoop_conf.set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY) spark = SparkSession.builder \ .config(conf=sc.getConf()) \ .getOrCreate()
Reading !!! Attention - reading with s3a prefix
s3a
df = spark.read.parquet("s3a://kkr-de-zoomcamp/ny-taxi-data/green_tripdata_2019-07.parquet")
Writing
df.write.format("csv") \ .option("header","True") \ .save("s3a://<your_bucket_name_here>/<your_folder_here>", mode="overwrite")
After downloading jars we can copy them to SPARK_HOME/jars folder thus its doesnt need to set("spark.jars", "...")
jars
SPARK_HOME/jars
Download jars from Maven Repository
Spark connection to Yandex Cloud Object Storage
Настройки комиттеров S3A
Visit Maven Repository for two
.jars
hadoop-aws-<version>
andaws-java-sdk-bundle-<version>
in Compile Dependencies Create a folderlib
and download into itCreating Spark Session and conf
Reading !!! Attention - reading with
s3a
prefixWriting
(Optional way)
After downloading
jars
we can copy them toSPARK_HOME/jars
folder thus its doesnt need to set("spark.jars", "...")