aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
155 stars 78 forks source link

Hive Example Fails when using JSON Data #33

Closed jrolstad closed 2 years ago

jrolstad commented 2 years ago

I am using the hive example as a template, but instead using json data. Using the setup below, I receive an error every time. Is there a different setup I should be using? Also, is there a good example for hive / EMR Serverless using JSON data that should work?

Having an additional hive example based on JSON in this repository would be helpful since the existing one uses CSV formatted data

Details

Given an S3 bucket that contains files with a format such as

{"Id":"123","Name":"my-name","Type":"some-type"}

and the initialization script here and the query script here

When I run a job in EMR serverless using these inputs Then I receive an error message stating

Job failed, please check complete logs in configured logging destination. ExitCode: 2. Last few exceptions: Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found Caused by: java.lang.RuntimeException: Map operator initialization failed ], TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : attempt_1665435121822_0001_1_00_000000_2:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found Caused by: java.lang.RuntimeException: Map operator initialization failed...
dacort commented 2 years ago

Hi @jrolstad! I'm guessing based on the error message we don't have org.apache.hadoop.hive.serde2.JsonSerDe included by default with EMR Serverless. I think there's a couple options here:

I've poked around a little bit and it's not immediately clear to me which one to use. We might have org.openx.data.jsonserde.JsonSerDe so go ahead and give that one a try too.

jrolstad commented 2 years ago

Thanks, that's what I was assuming. Can you tell me where i can find the available libraries for EMR serverless so i can self-serve next time?

dacort commented 2 years ago

I was trying to find this info as well. :) I'd take a look at the default SerDe's listed for Hive ( https://cwiki.apache.org/confluence/display/Hive/SerDe ) since that's what will be included with each EMR release.

Also, if you have access to an EMR cluster or the EMR on EKS container images, you can poke around for Hive jars for your specific version.

# Run bash in the EMR on EKS container image
docker run --rm -it 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0 /bin/bash
# Find hive jars 
find / -iname '*hive*.jar'

# Look for JSON serdes
jar tvf /usr/lib/spark/jars/hive-serde-2.3.9-amzn-1.jar | grep -i json
dacort commented 2 years ago

(correcting my previous comment after looking deeper at the documentation myself)

I think org.apache.hive.hcatalog.data.JsonSerDe moved to org.apache.hadoop.hive.serde2.JsonSerDe in Hive 3 (EMR 6.x), so give that a shot. I flipped the class names in my original comment.

I'll try to give this a try on my end as well.

jrolstad commented 2 years ago

@dacort Thanks for the update on the naming. I tried the org.apache.hadoop.hive.serde2.JsonSerDe value and still received the same result (java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found) so I think there may be a version mismatch in one of the EMR serverless libraries being used to run these jobs.

I'm not using a cluster or EKS (trying to go all serverless) so unable to verify versions. Waiting to hear what you find on your end as well.

jrolstad commented 2 years ago

@dacort Let me know if you are able to verify on your side as well. If so, let me know where the log the issue for this as using EMR serverless with JSON data in an S3 bucket seems like a standard use case that should be addressed.

dacort commented 2 years ago

@jrolstad Just gave it a shot using your scripts linked above and it worked fine for me.

One thing I noticed is that you're using org.apache.hadoop.hive.serde2.JsonSerDe in the user_createtables.sql script, but your error message says that org.apache.hive.hcatalog.data.JsonSerDe is the class that's not found. If you ran the createtables script previously with the latter serde, you'll need to drop that table before running the script again. Hive on EMR Serverless uses the Glue Data Catalog, so you can either delete it in the Glue Console or add a DROP TABLE statement. This confused me as well, so I should make it more explicit in the README here.

As an aside, you can run the EMR on EKS container image locally without having to use EKS. It's handy for when you want to have a local EMR environment, but is primarily geared towards Spark.

jrolstad commented 2 years ago

Dropping the table and recreating worked! Thanks for the help.

dacort commented 2 years ago

Sweet, thanks for following up! I'll add a note to the Hive section re: that specific Serde.