Closed berglh closed 2 years ago
I am pretty sure my problem is related to trying to use a hive 3.x patch on Spark 3.1.x/Hive 2.7.x build.
@berglh How were you able to resolve the issue?
@yongkyunlee I ended up downloading the EMR docker container from AWS image store. The main reason was that EMR build is designed to run Apache Spark and interact with AWS services - their jar builds will work with equivalent versions of Spark. I then actually ended up using the AWS built Apache Spark versions by copying the built versions from the EMR image into a new container that we launch in Amazon EKS via Airflow/Kubeflow notebook servers.
Just note that we move everything back to the stock Apache Spark default locations rather than the locations that AWS uses in the EMR container. We also spent quite a lot of time updating our Apache Spark configuration to implement the default configuration from the AWS container meshed with our prior Apache Spark builds that implemented a manual build of the jar. Most of which is explained in the README of this repo, but there are some other critical configurations to observe in the container.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html
To grab the container, we login to aws ecr with your AWS account STS tokens.
aws ecr get-login-password --region ap-southeast-2 | sudo docker login --username AWS --password-stdin 038297999601.dkr.ecr.ap-southeast-2.amazonaws.com
sudo docker pull 038297999601.dkr.ecr.ap-southeast-2.amazonaws.com/spark/emr-$(EMR_VERSION):latest
We are currently running:
######
## EMR & Spark Related Versions
# The following versions need to match those used in the appropriate EMR version https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html
EMR_VERSION=6.12.0
SPARK_VERSION=3.4.0
HADOOP_VERSION=3.3.3
HIVE_VERSION=3.1.3
SCALA_VERSION=2.12
AWS_JAVA_SDK_VERSION=1.12.490
#######
## Build: Versions used specifically in the jar building container (build) not in the final container
MAVEN_VERSION=3.8.8
PYTHON_VERSION=3.9
I'll attach some of our Dockerfile we used to build it, please note I excluded the build steps we used for some custom plugins we used. We run the build using Docker Compose which gets the versions for each package as arguments. The main thing is to match all the versions with what's in the EMR container. I tried to build Spark 3.4.1 with this approach but it isn't working and I haven't gotten around to sorting it yet, hopefully you're familiar enough with Spark in general and Docker to figure it out :) You should see where I copy all the jars from in the container which should allow you to retrieve the built artefacts and the config they are using.
@yongkyunlee Just a heads up, if you go down this path, Spark 3.4 has moved to log4j2 logging, if you want the Pyspark INFO level logs, the configuration in the log42j properties is:
# Set the Pyspark default logging level, if it's not the same level as the rootLogger
# a warning message is printed on Pyspark REPL shell startup - keep this the same
logger.python.name = org.apache.spark.api.python.PythonGatewayServer
logger.python.level = info
This took me a lot of digging through the source code and manual building of Apache Spark to figure out as it wasn't included or documented. Maybe this has changed in newer versions.
@berglh Thanks a lot for the detailed explanations! Really appreciate it. I'll let you know if I have any follow-up questions.
Just to share some updates, I was able to get it working by first building hive 2.3 and then hive 3.1 per the README instructions (I had to exclude two dpenedencies to make hive 3.1 build work).
When I tried building the glue data catalog spark/hive client after that, the shim dependencies were picked up.
Then when actually executing spark (thanks for the hints from the Dockerfile you attached), I had to add a a lot of hive 2.3 jars as dependency and use spark.sql.hive.metastore.jars
config to tell Spark to use the patched hive 2.3 for metastore.
@yongkyunlee I also did get this building for Spark 3.3 but I've since lost my Dockerfile for it, and while I was able to list the metastore catalog, I was definitely finding some issues in full functionality with loading tables - I think your trick with including Hive 2.3 jars was probably the missing thing.
I then asked myself, why bother building it if it's bundled in EMR container already; it was just a much faster container build to base off of working solution and I really wanted to ensure I was getting all the advantages of the latest version of Hadoop for object store committers optimisation - it was the easier solution in the end, but I'm glad you got it working, just ensure all functionality you require is working as expected - kudos! :+1:
@pmiten I had been following #32 where @fnapolitano73 suggested to try the
branch-3.4.0
branch for Spark 3.1 (Hive 3.1.3) support.I am wanting to build in this client for use in an EKS pod to match the AWS guide here: https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ which doesn't mention anything about accessing Glue via Hive Metastore. We are currently using Spark 3.0.1-SNAPSHOT and and old version of Hive which doesn't give us access to the S3A improvements in the newer Hadoop 3 versions.
I followed the README.md on
branch-3.4.0
, git cloned hive, applied the patch in this repo, built Hive 3.1.3 as per the instructions just fine using openjdk-8 and maven 3.6.3.I then moved into the directory of this repo with branch-3.4.0 checked out and ensured the pom.xml referenced the 3.1.3 hive3 version.
The instructions say to change into the
aws-glue-datacatalog-hive2-client
folder, however, that doesn't exist in this branch. So I just go ahead and try to build the client int theaws-glue-datacatalog-hive3-client
. I receive this error:If I try to just build the Spark client and not the Hive client as well:
I also tried to run
mvn clean package -DskipTests
at the project top directory to build all the components, and then that failed with these errors, after building some of the components OK:In any event, I'm just not able to get this branch to build at all. I feel like there is some basic configuration missing to get this working.