awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Apache License 2.0
200 stars 119 forks source link

Pig HcatStorer fails with AWS Glue Data Catalog as metastore for Hive. #37

Open dgghosalaws opened 3 years ago

dgghosalaws commented 3 years ago

Use case Running the example here - > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hcatalog-pig.html Outcome: Pig script Fails when Glue is the hive metastore.Script reports fail status. The files are written in S3 though Error logs

OperationException: getTokenStrForm is not supported
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:257)
    at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigOutputFormatTez$PigOutputCommitterTez.commitJob(PigOutputFormatTez.java:98)
    at org.apache.tez.mapreduce.committer.MROutputCommitter.commitOutput(MROutputCommitter.java:99)
    at org.apache.tez.dag.app.dag.impl.DAGImpl$1.run(DAGImpl.java:1032)
    at org.apache.tez.dag.app.dag.impl.DAGImpl$1.run(DAGImpl.java:1029)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.tez.dag.app.dag.impl.DAGImpl.commitOutput(DAGImpl.java:1029)
    at org.apache.tez.dag.app.dag.impl.DAGImpl.access$2000(DAGImpl.java:149)
    at org.apache.tez.dag.app.dag.impl.DAGImpl$3.call(DAGImpl.java:1108)
    at org.apache.tez.dag.app.dag.impl.DAGImpl$3.call(DAGImpl.java:1103)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: getTokenStrForm is not supported
    at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getTokenStrForm(GlueMetastoreClientDelegate.java:1583)
    at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.getTokenStrForm(AWSCatalogMetastoreClient.java:516)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hive.hcatalog.common.HiveClientCache$CacheableHiveMetaStoreClient.invoke(HiveClientCache.java:590)
    at com.sun.proxy.$Proxy67.getTokenStrForm(Unknown Source)
    at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.cancelDelegationTokens(FileOutputCommitterContainer.java:1012)
    at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitJob(FileOutputCommitterContainer.java:274)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:255)
itharavi commented 3 years ago

+1

Oleks777 commented 2 years ago

+1

i get another error: Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)

moneroexamples commented 2 years ago

@Oleks777

I had the same issue on emr-5.36.0 (did not test other version) when trying to use pig with HCatalog, so that I can load tables from Glue to Pig:

pig -useHCatalog

In my case the solution was to manually specify the missing jar:

pig -useHCatalog -Dpig.additional.jars=/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive2-client-1.18.0.jar

On other emr version, aws-glue-datacatalog-hive2-client-1.18.0.jar may have different number. So go to /usr/share/aws/hmclient/lib/ and check.

Then to load data from glue table:

data = LOAD 'somedatabase.sometablename' USING org.apache.hive.hcatalog.pig.HCatLoader();

then check:

describe data;
Oleks777 commented 2 years ago

thanks @moneroexamples ! yes, it did the trick. Instead of adding the jar like you describe, you can also use REGISTER command in the script.

Looks like this solution works only for 5x EMR releases (hive2), it doesn't work for 6x. Does anyone have any advice?

moneroexamples commented 2 years ago

@Oleks777 I just checked on emr-6.6 and the following works:

pig -useHCatalog -Dpig.additional.jars=/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive3-client-3.5.0.jar

As a side note. On EMR 6.6, hcat also does not work in itself with glue:

hcat -e "show databases;"

giving error:

Caused by: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)

you can solve this by by setting up HIVE_AUX_JARS_PATH before you call hcat:

export  HIVE_AUX_JARS_PATH=/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive3-client-3.5.0.jar
hcat -e "show databases;"
Oleks777 commented 2 years ago

@moneroexamples many thanks! i spent a lot of time to compile the client for hive2 and it is good to know there is a compiled version available from AWS. Is this path: /usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive3-client-3.5.0.jar available on the datanodes by default or emr needs to be configured somehow in the bootstrap step?

dgghosalaws commented 2 years ago

I request all to either support the premsie of the issue title or confirm if HCatStorer for partition write works with Glue data catalog as hive metastore. I completely get the iterations done above to make basic commands work with Pig on EMR. Thanks

moneroexamples commented 2 years ago

@Oleks777 Sadly I don't know how to configure EMR so that the extra paths/jars are loaded for Pig and hcat at bootstrap step.

eagleshine commented 2 years ago

Any update on this issue? We also encountered the same getTokenStrForm is not supported error when using HCatStorer(...) in EMR.

zsaltys commented 1 year ago

I'm getting the same error when storing data to ORC or Parquet tables with latest version of EMR 6.12.0. It seems support to write to Glue tables is broken.

zsaltys commented 1 year ago

After a little bit of digging we can see the problem originates here:

at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.cancelDelegationTokens(FileOutputCommitterContainer.java:1012) at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitJob(FileOutputCommitterContainer.java:274)

If we look at the file:

https://github.com/apache/hive/blob/920f9e535db6270a401db274eef3267d70c1fd2f/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/FileOutputCommitterContainer.java#L258

We can see that cancellingDelegationTokens is the last thing that happens. We can also see how it's used:

https://github.com/apache/hive/blob/920f9e535db6270a401db274eef3267d70c1fd2f/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/FileOutputCommitterContainer.java#L997

All we really need to do is to return a null instead of throwing operation not supported and then delegation cancel method should work fine.