awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Apache License 2.0
196 stars 118 forks source link

Building aws-glue-datacatalog-hive2-client fails #21

Open axelborja opened 4 years ago

axelborja commented 4 years ago

Here, is the Dockerfile I use to build the patched hive and spark clients:

FROM python:3.6-buster

# SET WORKDIR
WORKDIR /src

# INSTALL JAVA
RUN echo "deb http://ftp.us.debian.org/debian sid main" >> /etc/apt/sources.list && \
    apt-get update && \
    apt-get install -y openjdk-8-jdk && \
    rm -rf /var/cache/apt/*

# INSTALL MAVEN as EXCEPTED by GLUE
RUN apt-get install -y wget
RUN wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
RUN tar zxvf apache-maven-3.6.0-bin.tar.gz
ENV PATH=/src/apache-maven-3.6.0/bin:$PATH
RUN rm apache-maven-3.6.0-bin.tar.gz

# BUILD PATCHED HIVE FOR HIVE CLIENT
WORKDIR /src
RUN git clone https://github.com/apache/hive.git
WORKDIR /src/hive
RUN wget https://issues.apache.org/jira/secure/attachment/12958418/HIVE-12679.branch-2.3.patch
RUN git checkout branch-2.3
RUN patch -p0 <HIVE-12679.branch-2.3.patch
RUN mvn clean install -DskipTests

# BUILD PATCHED HIVE CLIENT
WORKDIR /src
RUN git clone https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore.git
WORKDIR /src/aws-glue-data-catalog-client-for-apache-hive-metastore
RUN sed -i  's/2.3.3/2.3.7-SNAPSHOT/g' pom.xml
WORKDIR /src/aws-glue-data-catalog-client-for-apache-hive-metastore/aws-glue-datacatalog-hive2-client
# IT FAILS HERE:
RUN mvn clean package -DskipTests

# BUILD PATCHED SPARK CLIENT
# ...

The encountered error is:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  18.216 s
[INFO] Finished at: 2020-03-18T21:04:39Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project aws-glue-datacatalog-hive2-client: Could not resolve dependencies for project com.amazonaws.glue:aws-glue-datacatalog-hive2-client:jar:1.10.0-SNAPSHOT: The following artifacts could not be resolved: com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:1.10.0-SNAPSHOT, com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:tests:1.10.0-SNAPSHOT: Could not find artifact com.amazonaws.glue:aws-glue-datacatalog-client-common:jar:1.10.0-SNAPSHOT -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
hortonworks-sk commented 4 years ago

Getting same issue.

axelborja commented 4 years ago

Unfortunately, issues do not seem to be considered since June 2019 😞

edugfilho commented 4 years ago

I found out the root cause of this problem is: aws-glue-datacatalog-hive2-client depends on aws-glue-datacatalog-client-common, aws-glue-datacatalog-client-commondepends on shims/shims-loader, shims/shims-loader depends on shims/spark-hive-shims shims/spark-hive-shimsneeds hive-exec version (which is 1.2.1). I looked upon mvn-repository and learned version 1.2.1 doesn't exist.

So I decided to use the available 1.2.1.spark2 and updated the following line on aws-glue-data-catalog-client-for-apache-hive-metastore/pom.xml:

1.2.1.spark2</spark-hive.version>

Then it should work out the box. In case it doesn't, do the following before you proceed:

EDIT: If you're also going to build the Spark Client, just remember to follow the README and change spark-hive.version back to 1.2.3-SNAPSHOT before doing it.

Let me know if that works for you!

aws-austin-lee commented 4 years ago

Thanks for figuring that out and sharing it here!

axelborja commented 4 years ago

Thank you @edugfilho. It seems to work fine.

vivek-menon commented 4 years ago

I tried the approach mentioned and individual steps doesn't seem to work but when I built the package from base location, it builds the following and I am getting the following error:

[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] AWSGlueDataCatalogHiveClient ....................... SUCCESS [ 0.186 s] [INFO] AwsGlueDataCatalogShims ............................ SUCCESS [ 0.002 s] [INFO] ShimsCommon ........................................ SUCCESS [ 3.153 s] [INFO] Hive2Shims ......................................... SUCCESS [ 0.841 s] [INFO] spark-hive-shims ................................... SUCCESS [ 0.871 s] [INFO] ShimsLoader ........................................ SUCCESS [ 0.883 s] [INFO] AWSGlueDataCatalogClientCommon ..................... SUCCESS [ 3.331 s] [INFO] AWSGlueDataCatalogHive2Client ...................... FAILURE [ 2.338 s] [INFO] AWSGlueDataCatalogSparkClient ...................... SKIPPED [INFO] ------------------------------------------------------------------------

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.1:compile (default-compile) on project aws-glue-datacatalog-hive2-client: Compilation failure [ERROR] /usr/local/aws-glue-data-catalog-client-for-apache-hive-metastore/aws-glue-datacatalog-hive2-client/src/main/java/com/amazonaws/glue/catalog/metastore/AWSCatalogMetastoreClient.java:[117,7] error: AWSCatalogMetastoreClient is not abstract and does not override abstract method listPartitionValues(PartitionValuesRequest) in IMetaStoreClient

Any idea?

ncolomer commented 4 years ago

With @axelborja we open sourced the procedure in the repo tinyclues/spark-glue-data-catalog and released the distribuable via GitHub Release (available for download).

kidotaka commented 3 years ago

https://issues.apache.org/jira/plugins/servlet/mobile#issue/HIVE-21859

https://github.com/apache/hive/commit/9fb2238fac7707b2fbb3a33066d1f9cc077904f3

In hive 2.3.6 developmet, IMetaStoreClient interface changed with a new method "listPartitionValues" on 18 Jun 2019. AWSCatalogMetastoreClient in this repository does not implement the method yet.

simplest work around is to patch aginst hive 2.3.5.

jpugliesi commented 3 years ago

For what it's worth, we've open sourced a fully contained build and docker image for Spark 3.1.1 (with the kubernetes deps), Hadoop 3.2.0, Hive 2.3.7, and this glue client: https://github.com/viaduct-ai/docker-spark-k8s-aws