awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Apache License 2.0
205 stars 120 forks source link

Building branch-3.4.0 results in dependency errors #60

Closed berglh closed 2 years ago

berglh commented 2 years ago

@pmiten I had been following #32 where @fnapolitano73 suggested to try the branch-3.4.0 branch for Spark 3.1 (Hive 3.1.3) support.

Please refer to branch 3.4

I am wanting to build in this client for use in an EKS pod to match the AWS guide here: https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ which doesn't mention anything about accessing Glue via Hive Metastore. We are currently using Spark 3.0.1-SNAPSHOT and and old version of Hive which doesn't give us access to the S3A improvements in the newer Hadoop 3 versions.

I followed the README.md on branch-3.4.0, git cloned hive, applied the patch in this repo, built Hive 3.1.3 as per the instructions just fine using openjdk-8 and maven 3.6.3.

I then moved into the directory of this repo with branch-3.4.0 checked out and ensured the pom.xml referenced the 3.1.3 hive3 version.

The instructions say to change into the aws-glue-datacatalog-hive2-client folder, however, that doesn't exist in this branch. So I just go ahead and try to build the client int the aws-glue-datacatalog-hive3-client. I receive this error:

/aws-glue-datacatalog-hive3-client# mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO] 
[INFO] --------< com.amazonaws.glue:aws-glue-datacatalog-hive3-client >--------
[INFO] Building AWSGlueDataCatalogHive3Client 3.4.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[WARNING] The POM for com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:3.4.0-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:tests:3.4.0-SNAPSHOT is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  0.463 s
[INFO] Finished at: 2022-05-19T04:42:31Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project aws-glue-datacatalog-hive3-client: Could not resolve dependencies for project com.amazonaws.glue:aws-glue-datacatalog-hive3-client:jar:3.4.0-SNAPSHOT: The following artifacts could not be resolved: com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:3.4.0-SNAPSHOT, com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:tests:3.4.0-SNAPSHOT: Could not find artifact com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:3.4.0-SNAPSHOT -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

If I try to just build the Spark client and not the Hive client as well:

aws-glue-datacatalog-spark-client# mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO] 
[INFO] --------< com.amazonaws.glue:aws-glue-datacatalog-spark-client >--------
[INFO] Building AWSGlueDataCatalogSparkClient 3.4.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[WARNING] The POM for org.apache.hive:hive-metastore:jar:2.3.10-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.apache.hive:hive-exec:jar:2.3.10-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:3.4.0-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:tests:3.4.0-SNAPSHOT is missing, no dependency information available
Downloading from central: https://repo.maven.apache.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.7.0/jackson-annotations-2.7.0.jar
Downloading from central: https://repo.maven.apache.org/maven2/org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.jar
Downloading from central: https://repo.maven.apache.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.7.8/jackson-databind-2.7.8.jar
Downloading from central: https://repo.maven.apache.org/maven2/com/fasterxml/jackson/core/jackson-core/2.7.8/jackson-core-2.7.8.jar
Downloaded from central: https://repo.maven.apache.org/maven2/org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.jar (41 kB at 38 kB/s)
Downloaded from central: https://repo.maven.apache.org/maven2/com/fasterxml/jackson/core/jackson-annotations/2.7.0/jackson-annotations-2.7.0.jar (51 kB at 46 kB/s)
Downloaded from central: https://repo.maven.apache.org/maven2/com/fasterxml/jackson/core/jackson-core/2.7.8/jackson-core-2.7.8.jar (253 kB at 175 kB/s)
Downloaded from central: https://repo.maven.apache.org/maven2/com/fasterxml/jackson/core/jackson-databind/2.7.8/jackson-databind-2.7.8.jar (1.2 MB at 533 kB/s)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.651 s
[INFO] Finished at: 2022-05-19T05:16:13Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project aws-glue-datacatalog-spark-client: Could not resolve dependencies for project com.amazonaws.glue:aws-glue-datacatalog-spark-client:jar:3.4.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.hive:hive-metastore:jar:2.3.10-SNAPSHOT, org.apache.hive:hive-exec:jar:2.3.10-SNAPSHOT, com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:3.4.0-SNAPSHOT, com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:tests:3.4.0-SNAPSHOT: Could not find artifact org.apache.hive:hive-metastore:jar:2.3.10-SNAPSHOT -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

I also tried to run mvn clean package -DskipTests at the project top directory to build all the components, and then that failed with these errors, after building some of the components OK:

aws-glue-data-catalog-client-for-apache-hive-metastore# mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] 
[INFO] AWSGlueDataCatalogHiveClient                                       [pom]
[INFO] AwsGlueDataCatalogShims                                            [pom]
[INFO] ShimsCommon                                                        [jar]
[INFO] Hive3Shims                                                         [jar]
[INFO] spark-hive-shims                                                   [jar]
[INFO] ShimsLoader                                                        [jar]
[INFO] AWSGlueDataCatalogClientCommon                                     [jar]
[INFO] AWSGlueDataCatalogSparkClient                                      [jar]
[INFO] AWSGlueDataCatalogHive3Client                                      [jar]
[INFO] 
[INFO] --------< com.amazonaws.glue:aws-glue-datacatalog-hive-client >---------
[INFO] Building AWSGlueDataCatalogHiveClient 3.4.0-SNAPSHOT               [1/9]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ aws-glue-datacatalog-hive-client ---
[INFO] 
[INFO] ----------------------< com.amazonaws.glue:shims >----------------------
[INFO] Building AwsGlueDataCatalogShims 3.4.0-SNAPSHOT                    [2/9]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ shims ---
[INFO] 
[INFO] ------------------< com.amazonaws.glue:shims-common >-------------------
[INFO] Building ShimsCommon 3.4.0-SNAPSHOT                                [3/9]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ shims-common ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ shims-common ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/common/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.1:compile (default-compile) @ shims-common ---
[INFO] Compiling 1 source file to /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/common/target/classes
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ shims-common ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/common/src/test/resources
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.1:testCompile (default-testCompile) @ shims-common ---
[INFO] No sources to compile
[INFO] 
[INFO] --- maven-surefire-plugin:2.5:test (default-test) @ shims-common ---
[INFO] Tests are skipped.
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ shims-common ---
[INFO] Building jar: /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/common/target/shims-common-3.4.0-SNAPSHOT.jar
[INFO] 
[INFO] -------------------< com.amazonaws.glue:hive3-shims >-------------------
[INFO] Building Hive3Shims 3.4.0-SNAPSHOT                                 [4/9]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive3-shims ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ hive3-shims ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/hive3-shims/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.1:compile (default-compile) @ hive3-shims ---
[INFO] Compiling 1 source file to /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/hive3-shims/target/classes
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ hive3-shims ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/hive3-shims/src/test/resources
[INFO] 
[INFO] --- maven-compiler-plugin:2.3.1:testCompile (default-testCompile) @ hive3-shims ---
[INFO] No sources to compile
[INFO] 
[INFO] --- maven-surefire-plugin:2.5:test (default-test) @ hive3-shims ---
[INFO] Tests are skipped.
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ hive3-shims ---
[INFO] Building jar: /build/src/test/aws-glue-data-catalog-client-for-apache-hive-metastore/shims/hive3-shims/target/hive3-shims-3.4.0-SNAPSHOT.jar
[INFO] 
[INFO] ----------------< com.amazonaws.glue:spark-hive-shims >-----------------
[INFO] Building spark-hive-shims 3.4.0-SNAPSHOT                           [5/9]
[INFO] --------------------------------[ jar ]---------------------------------
[WARNING] The POM for org.apache.hive:hive-exec:jar:2.3.10-SNAPSHOT is missing, no dependency information available
[WARNING] The POM for org.apache.hive:hive-metastore:jar:2.3.10-SNAPSHOT is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for AWSGlueDataCatalogHiveClient 3.4.0-SNAPSHOT:
[INFO] 
[INFO] AWSGlueDataCatalogHiveClient ....................... SUCCESS [  0.062 s]
[INFO] AwsGlueDataCatalogShims ............................ SUCCESS [  0.001 s]
[INFO] ShimsCommon ........................................ SUCCESS [  1.007 s]
[INFO] Hive3Shims ......................................... SUCCESS [  0.380 s]
[INFO] spark-hive-shims ................................... FAILURE [  0.005 s]
[INFO] ShimsLoader ........................................ SKIPPED
[INFO] AWSGlueDataCatalogClientCommon ..................... SKIPPED
[INFO] AWSGlueDataCatalogSparkClient ...................... SKIPPED
[INFO] AWSGlueDataCatalogHive3Client ...................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.517 s
[INFO] Finished at: 2022-05-19T04:48:55Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project spark-hive-shims: Could not resolve dependencies for project com.amazonaws.glue:spark-hive-shims:jar:3.4.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.hive:hive-exec:jar:2.3.10-SNAPSHOT, org.apache.hive:hive-metastore:jar:2.3.10-SNAPSHOT: Could not find artifact org.apache.hive:hive-exec:jar:2.3.10-SNAPSHOT -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :spark-hive-shims

In any event, I'm just not able to get this branch to build at all. I feel like there is some basic configuration missing to get this working.

berglh commented 2 years ago

I am pretty sure my problem is related to trying to use a hive 3.x patch on Spark 3.1.x/Hive 2.7.x build.

yongkyunlee commented 7 months ago

@berglh How were you able to resolve the issue?

berglh commented 7 months ago

@yongkyunlee I ended up downloading the EMR docker container from AWS image store. The main reason was that EMR build is designed to run Apache Spark and interact with AWS services - their jar builds will work with equivalent versions of Spark. I then actually ended up using the AWS built Apache Spark versions by copying the built versions from the EMR image into a new container that we launch in Amazon EKS via Airflow/Kubeflow notebook servers.

Just note that we move everything back to the stock Apache Spark default locations rather than the locations that AWS uses in the EMR container. We also spent quite a lot of time updating our Apache Spark configuration to implement the default configuration from the AWS container meshed with our prior Apache Spark builds that implemented a manual build of the jar. Most of which is explained in the README of this repo, but there are some other critical configurations to observe in the container.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6150-release.html

To grab the container, we login to aws ecr with your AWS account STS tokens.

aws ecr get-login-password --region ap-southeast-2 | sudo docker login --username AWS --password-stdin 038297999601.dkr.ecr.ap-southeast-2.amazonaws.com
sudo docker pull 038297999601.dkr.ecr.ap-southeast-2.amazonaws.com/spark/emr-$(EMR_VERSION):latest

We are currently running:


######
## EMR & Spark Related Versions
# The following versions need to match those used in the appropriate EMR version https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-app-versions-6.x.html
EMR_VERSION=6.12.0
SPARK_VERSION=3.4.0
HADOOP_VERSION=3.3.3
HIVE_VERSION=3.1.3
SCALA_VERSION=2.12
AWS_JAVA_SDK_VERSION=1.12.490

#######
## Build: Versions used specifically in the jar building container (build) not in the final container
MAVEN_VERSION=3.8.8
PYTHON_VERSION=3.9

I'll attach some of our Dockerfile we used to build it, please note I excluded the build steps we used for some custom plugins we used. We run the build using Docker Compose which gets the versions for each package as arguments. The main thing is to match all the versions with what's in the EMR container. I tried to build Spark 3.4.1 with this approach but it isn't working and I haven't gotten around to sorting it yet, hopefully you're familiar enough with Spark in general and Docker to figure it out :) You should see where I copy all the jars from in the container which should allow you to retrieve the built artefacts and the config they are using.

Dockerfile.txt

berglh commented 7 months ago

@yongkyunlee Just a heads up, if you go down this path, Spark 3.4 has moved to log4j2 logging, if you want the Pyspark INFO level logs, the configuration in the log42j properties is:

# Set the Pyspark default logging level, if it's not the same level as the rootLogger
# a warning message is printed on Pyspark REPL shell startup - keep this the same
logger.python.name = org.apache.spark.api.python.PythonGatewayServer
logger.python.level = info

This took me a lot of digging through the source code and manual building of Apache Spark to figure out as it wasn't included or documented. Maybe this has changed in newer versions.

yongkyunlee commented 7 months ago

@berglh Thanks a lot for the detailed explanations! Really appreciate it. I'll let you know if I have any follow-up questions.

yongkyunlee commented 7 months ago

Just to share some updates, I was able to get it working by first building hive 2.3 and then hive 3.1 per the README instructions (I had to exclude two dpenedencies to make hive 3.1 build work). When I tried building the glue data catalog spark/hive client after that, the shim dependencies were picked up. Then when actually executing spark (thanks for the hints from the Dockerfile you attached), I had to add a a lot of hive 2.3 jars as dependency and use spark.sql.hive.metastore.jars config to tell Spark to use the patched hive 2.3 for metastore.

berglh commented 7 months ago

@yongkyunlee I also did get this building for Spark 3.3 but I've since lost my Dockerfile for it, and while I was able to list the metastore catalog, I was definitely finding some issues in full functionality with loading tables - I think your trick with including Hive 2.3 jars was probably the missing thing.

I then asked myself, why bother building it if it's bundled in EMR container already; it was just a much faster container build to base off of working solution and I really wanted to ensure I was getting all the advantages of the latest version of Hadoop for object store committers optimisation - it was the easier solution in the end, but I'm glad you got it working, just ensure all functionality you require is working as expected - kudos! :+1: