awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 300 forks source link

Release Official Glue v3 Dockerfile #113

Closed archenroot closed 2 years ago

archenroot commented 2 years ago

Hello, I requested this artifact to be publicly available already in another issue but requesting again. It will help the community when building local environments where I need to integrate some other systems:

Thank you. You can copy-paste the docker file content even here to issue, that is fine.

Thanks.

archenroot commented 2 years ago

So to get glue working with minio s3 storage I had to do following:

  1. customized content on existing Glue v3 image:
    
    FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01
    USER root
    RUN echo 'root:root' | chpasswd
    RUN yum install -y ping telnet sudo
    ADD ./aws-glue/config/spark/core-site.xml /home/glue_user/spark/conf
    RUN chown glue_user:root /home/glue_user/spark/conf/core-site.xml

RUN mkdir /home/glue_user/.aws ADD .aws /home/glue_user/ RUN chown glue_user:root -R /home/glue_user/.aws WORKDIR /home/glue_user USER glue_user CMD ["./jupyter/jupyter_start.sh"]


Added core-site.xml custom content, please note my host of s3 is nginx as I deploy 2-4 node minio cluster for local development with nginx as communication gateway:

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

fs.s3a.endpoint http://nginx:9000 fs.s3a.connection.maximum 500 Controls the maximum number of simultaneous connections to S3. fs.s3a.connection.ssl.enabled false fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider, com.amazonaws.auth.profile.ProfileCredentialsProvider, com.amazonaws.auth.DefaultAWSCredentialsProviderChain fs.s3.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider, com.amazonaws.auth.profile.ProfileCredentialsProvider, com.amazonaws.auth.DefaultAWSCredentialsProviderChain fs.s3a.path.style.access true fs.s3a.fast.upload true fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem

3. and spark-defaults.conf:

spark.driver.extraClassPath /home/glue_user/spark/jars/:/home/glue_user/aws-glue-libs/jars/ spark.executor.extraClassPath /home/glue_user/spark/jars/:/home/glue_user/aws-glue-libs/jars/ spark.master local spark.sql.catalogImplementation hive spark.eventLog.enabled true spark.history.fs.logDirectory file:////tmp/spark-events spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs false spark.unsafe.sorter.spill.read.ahead.enabled false spark.network.crypto.enabled true spark.network.crypto.keyLength 256 spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA256 spark.network.crypto.saslFallback false spark.authenticate true spark.io.encryption.enabled true spark.io.encryption.keySizeBits 256 spark.io.encryption.keygen.algorithm HmacSHA256 spark.authenticate.secret 0800ffef-37e7-4f73-abdc-73f2df7c58f2


Still I would welcome to have the base glue v3 image dockerfile
archenroot commented 2 years ago

For anyone interested in unofficial Dolckerfile: https://github.com/alrouen/local-aws-glue-v3-zeppelin/blob/main/docker/Dockerfile

archenroot commented 2 years ago

I know the topic is to get official glue image, but as per my requirement I had to integrate glue with minio and airflow images. I did it as above with minio, I will try to share whole docker-compose when it all works.

But at moment I found this airflow https://github.com/aws/aws-mwaa-local-runner/issues/71 docker image, but I don't know if I should use GlueOperator or SparkOperator to trigger the job in local run. In general I can implement in PySpark job level some command to detect if running in AWS or in docker container and based on that either create a session. Still this requires some additional customization of Glue image with respoect to spark-default.conf file, eg.: spark.master local change to spark.master spark://spark-master:7077

To achieve this I will need to either spin up master and single worker inside glue image directly. Or take AWS Glue Spark distribution outside, build out of that a docker componse cluster with single or multiple (based on configuraiton) nodes and expose master 7077 hostname which is supported by docker compose networking subsystem.

Now when running in AWS bellow script to see if env variable can be used for dynamic context initialization

import os

# printing environment variables
print(os.environ)

I get:

environ({'GLUE_VERSION': '3.0', 
         'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin', 
         'PYTHONPATH': '/opt/amazon/spark/jars/spark-core_2.12-3.1.1-amzn-0.jar:/opt/amazon/spark/python/lib/pyspark.zip:/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip:/opt/amazon/lib/python3.6/site-packages', 
         'USE_PROXY': 'false', 
         'AWS_DEFAULT_REGION': 'us-east-1', 
         'GLUE_TASK_GROUP_ID': '3623ed70-54c9-4202-ac25-43ef28a0f318', 
         'AWS_METADATA_SERVICE_NUM_ATTEMPTS': '50', 
         'GLUE_COMMAND_CRITERIA': 'glueetl', 
         'PYSPARK_GATEWAY_SECRET': '6d808e6b645edfd77f1d2dd32196082010707026418fdb1405473085bb73ea49', 
         'LANG': 'en_US.UTF-8', 
         'ERROR_FILE_NAME_LOCATION': '/reporting/error.txt',
         'SPARK_CONF_DIR': '/opt/amazon/conf',
         'OMP_NUM_THREADS': '4', 
         'PYSPARK_GATEWAY_PORT': '45811', 
         'PYSPARK_PYTHON': '/usr/bin/python3', 
         'PYTHONUNBUFFERED': 'YES', 'HOSTNAME': 'ip-172-34-132-150.ec2.internal', 
         'LD_LIBRARY_PATH': '/opt/amazon/lib/hadoop-lzo-native:/opt/amazon/lib/hadoop-native/:/opt/amazon/lib/glue-native', 
         'WORKER_TYPE': 'G.1X', 'PWD': '/tmp', 
         'HOME': '/home/spark',
         'SHLVL': '0',
         'GLUE_PYTHON_VERSION': '3', 
         'CONTAINER_HOST_PRIVATE_IP': '172.34.132.150'})

So to put this together I will define my jobs like this, as long as I don't use any kind of Glue extensions and stick with pure spark lib I should be fine even if I move to SparkOperator on EKS for example or EMR clusters:

import os
# Getting non-existent keys
GLUE_VERSION = os.getenv('GLUE_VERSION') # None

sc = None
spark = None

if (GLUE_VERSION != None):

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job

    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])

    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    job.commit()

if (GLUE_VERSION == None):
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    sc = spark.sparkContext

In similar way I will define Airflow DAGs, if on AWS use GlueOperator, if Docker, use SparkOperator.

gardner commented 2 years ago

This would be great for Apple hardware.

moomindani commented 2 years ago

Docker image for Glue 3.0 is available officially. Here's the blog post for that. https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/