Closed jrolstad closed 2 years ago
Hi @jrolstad! I'm guessing based on the error message we don't have org.apache.hadoop.hive.serde2.JsonSerDe
included by default with EMR Serverless. I think there's a couple options here:
org.apache.hive.hcatalog.data.JsonSerDe
?--packages
I've poked around a little bit and it's not immediately clear to me which one to use. We might have org.openx.data.jsonserde.JsonSerDe
so go ahead and give that one a try too.
Thanks, that's what I was assuming. Can you tell me where i can find the available libraries for EMR serverless so i can self-serve next time?
I was trying to find this info as well. :) I'd take a look at the default SerDe's listed for Hive ( https://cwiki.apache.org/confluence/display/Hive/SerDe ) since that's what will be included with each EMR release.
Also, if you have access to an EMR cluster or the EMR on EKS container images, you can poke around for Hive jars for your specific version.
# Run bash in the EMR on EKS container image
docker run --rm -it 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0 /bin/bash
# Find hive jars
find / -iname '*hive*.jar'
# Look for JSON serdes
jar tvf /usr/lib/spark/jars/hive-serde-2.3.9-amzn-1.jar | grep -i json
(correcting my previous comment after looking deeper at the documentation myself)
I think org.apache.hive.hcatalog.data.JsonSerDe
moved to org.apache.hadoop.hive.serde2.JsonSerDe
in Hive 3 (EMR 6.x), so give that a shot. I flipped the class names in my original comment.
I'll try to give this a try on my end as well.
@dacort Thanks for the update on the naming. I tried the org.apache.hadoop.hive.serde2.JsonSerDe
value and still received the same result (java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found
) so I think there may be a version mismatch in one of the EMR serverless libraries being used to run these jobs.
I'm not using a cluster or EKS (trying to go all serverless) so unable to verify versions. Waiting to hear what you find on your end as well.
@dacort Let me know if you are able to verify on your side as well. If so, let me know where the log the issue for this as using EMR serverless with JSON data in an S3 bucket seems like a standard use case that should be addressed.
@jrolstad Just gave it a shot using your scripts linked above and it worked fine for me.
One thing I noticed is that you're using org.apache.hadoop.hive.serde2.JsonSerDe
in the user_createtables.sql
script, but your error message says that org.apache.hive.hcatalog.data.JsonSerDe
is the class that's not found. If you ran the createtables script previously with the latter serde, you'll need to drop that table before running the script again. Hive on EMR Serverless uses the Glue Data Catalog, so you can either delete it in the Glue Console or add a DROP TABLE statement. This confused me as well, so I should make it more explicit in the README here.
As an aside, you can run the EMR on EKS container image locally without having to use EKS. It's handy for when you want to have a local EMR environment, but is primarily geared towards Spark.
Dropping the table and recreating worked! Thanks for the help.
Sweet, thanks for following up! I'll add a note to the Hive section re: that specific Serde.
I am using the hive example as a template, but instead using json data. Using the setup below, I receive an error every time. Is there a different setup I should be using? Also, is there a good example for hive / EMR Serverless using JSON data that should work?
Having an additional hive example based on JSON in this repository would be helpful since the existing one uses CSV formatted data
Details
Given an S3 bucket that contains files with a format such as
and the initialization script here and the query script here
When I run a job in EMR serverless using these inputs Then I receive an error message stating