Closed archenroot closed 2 years ago
So to get glue working with minio s3 storage I had to do following:
FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01
USER root
RUN echo 'root:root' | chpasswd
RUN yum install -y ping telnet sudo
ADD ./aws-glue/config/spark/core-site.xml /home/glue_user/spark/conf
RUN chown glue_user:root /home/glue_user/spark/conf/core-site.xml
RUN mkdir /home/glue_user/.aws ADD .aws /home/glue_user/ RUN chown glue_user:root -R /home/glue_user/.aws WORKDIR /home/glue_user USER glue_user CMD ["./jupyter/jupyter_start.sh"]
Added core-site.xml custom content, please note my host of s3 is nginx as I deploy 2-4 node minio cluster for local development with nginx as communication gateway:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3. and spark-defaults.conf:
spark.driver.extraClassPath /home/glue_user/spark/jars/:/home/glue_user/aws-glue-libs/jars/ spark.executor.extraClassPath /home/glue_user/spark/jars/:/home/glue_user/aws-glue-libs/jars/ spark.master local spark.sql.catalogImplementation hive spark.eventLog.enabled true spark.history.fs.logDirectory file:////tmp/spark-events spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs false spark.unsafe.sorter.spill.read.ahead.enabled false spark.network.crypto.enabled true spark.network.crypto.keyLength 256 spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA256 spark.network.crypto.saslFallback false spark.authenticate true spark.io.encryption.enabled true spark.io.encryption.keySizeBits 256 spark.io.encryption.keygen.algorithm HmacSHA256 spark.authenticate.secret 0800ffef-37e7-4f73-abdc-73f2df7c58f2
Still I would welcome to have the base glue v3 image dockerfile
For anyone interested in unofficial Dolckerfile: https://github.com/alrouen/local-aws-glue-v3-zeppelin/blob/main/docker/Dockerfile
I know the topic is to get official glue image, but as per my requirement I had to integrate glue with minio and airflow images. I did it as above with minio, I will try to share whole docker-compose when it all works.
But at moment I found this airflow https://github.com/aws/aws-mwaa-local-runner/issues/71 docker image, but I don't know if I should use GlueOperator or SparkOperator to trigger the job in local run. In general I can implement in PySpark job level some command to detect if running in AWS or in docker container and based on that either create a session. Still this requires some additional customization of Glue image with respoect to spark-default.conf file, eg.: spark.master local change to spark.master spark://spark-master:7077
To achieve this I will need to either spin up master and single worker inside glue image directly. Or take AWS Glue Spark distribution outside, build out of that a docker componse cluster with single or multiple (based on configuraiton) nodes and expose master 7077 hostname which is supported by docker compose networking subsystem.
Now when running in AWS bellow script to see if env variable can be used for dynamic context initialization
import os
# printing environment variables
print(os.environ)
I get:
environ({'GLUE_VERSION': '3.0',
'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin',
'PYTHONPATH': '/opt/amazon/spark/jars/spark-core_2.12-3.1.1-amzn-0.jar:/opt/amazon/spark/python/lib/pyspark.zip:/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip:/opt/amazon/lib/python3.6/site-packages',
'USE_PROXY': 'false',
'AWS_DEFAULT_REGION': 'us-east-1',
'GLUE_TASK_GROUP_ID': '3623ed70-54c9-4202-ac25-43ef28a0f318',
'AWS_METADATA_SERVICE_NUM_ATTEMPTS': '50',
'GLUE_COMMAND_CRITERIA': 'glueetl',
'PYSPARK_GATEWAY_SECRET': '6d808e6b645edfd77f1d2dd32196082010707026418fdb1405473085bb73ea49',
'LANG': 'en_US.UTF-8',
'ERROR_FILE_NAME_LOCATION': '/reporting/error.txt',
'SPARK_CONF_DIR': '/opt/amazon/conf',
'OMP_NUM_THREADS': '4',
'PYSPARK_GATEWAY_PORT': '45811',
'PYSPARK_PYTHON': '/usr/bin/python3',
'PYTHONUNBUFFERED': 'YES', 'HOSTNAME': 'ip-172-34-132-150.ec2.internal',
'LD_LIBRARY_PATH': '/opt/amazon/lib/hadoop-lzo-native:/opt/amazon/lib/hadoop-native/:/opt/amazon/lib/glue-native',
'WORKER_TYPE': 'G.1X', 'PWD': '/tmp',
'HOME': '/home/spark',
'SHLVL': '0',
'GLUE_PYTHON_VERSION': '3',
'CONTAINER_HOST_PRIVATE_IP': '172.34.132.150'})
So to put this together I will define my jobs like this, as long as I don't use any kind of Glue extensions and stick with pure spark lib I should be fine even if I move to SparkOperator on EKS for example or EMR clusters:
import os
# Getting non-existent keys
GLUE_VERSION = os.getenv('GLUE_VERSION') # None
sc = None
spark = None
if (GLUE_VERSION != None):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()
if (GLUE_VERSION == None):
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
sc = spark.sparkContext
In similar way I will define Airflow DAGs, if on AWS use GlueOperator, if Docker, use SparkOperator.
This would be great for Apple hardware.
Docker image for Glue 3.0 is available officially. Here's the blog post for that. https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/
Hello, I requested this artifact to be publicly available already in another issue but requesting again. It will help the community when building local environments where I need to integrate some other systems:
Thank you. You can copy-paste the docker file content even here to issue, that is fine.
Thanks.