Closed nadersalehi closed 7 years ago
I think that the most informative part of this stacktrace is the following exception:
Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
Based on this, I suspect that your EMR environment is using a different Avro version than whatever version is used when you run locally. I'm not super-familiar with Spark in EMR; do you know which version of Spark is being used there?
Also, just to clarify: which version of spark-avro
are you using?
EMR 4.0 uses spark 1.4.1 on Hadoop 2.6 YARN. I use spark-avro version 2.0.1-s_2.10.
A few things to mention:
spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 time_series_producer.py -i s3://my-input -o s3://my-output --s3-access
--s3-secret
Hey @nadersalehi
We use spark-avro on EMR 4.0 quite a lot and also ran into this problem initially. It's just the usual dependency hell. Ironically, though the fact that Avro is used quite a bit throughout the Hadoop ecosystem seemed like an advantage at first, it's turned out to be one of the main causes of difficulties, as it means there are often many different versions polluting the classpath in different deployment environments.
We found that EMR 4.0 does in fact provide Avro 1.7.4, but it should still be possible to make spark-avro work without hacking the EMR environment. We embed spark-avro in our jar as a bundled dependency and it seems to take priority over the jar on the EMR environment. The cause of this problem for us was actually that transitive dependencies from other libraries we use were also pulling in conflicting Avro versions. Try using whatever build tool you have to visualise your dependency tree and find any other Avro versions being pulled in first.
In any case, I don't think there's anything that can be done on the spark-avro side to improve this. This library necessarily needs to pull in Avro as a dependency and can't reasonably assume it'll be provided.
Hope that helps
@jaley
Thanks for the clarification. It makes a lot of sense. I can verify that EMR 4.0 is using Avro 1.7.4 --1.7.5 in case of pig. Our application uses pyspark, so I am not sure how I could embed the correct version of spark-avro as a dependency. Any help would be appreciated.
@JoshRosen & @jaley
I poked around a bit further and noticed that the problem stems from not having spark-avro in executor-nodes. As a workaround, I copied com.databricks_spark-avro_2.11-2.0.1.jar and org.apache.avro_avro-1.7.x.jar into /home/hadoop/.ivy2/jar, added the directory to SPARK_CLASSPATH and ran the application to completion.
I was under the impression that the '--package' option takes care of distributing a given package, and its dependencies, to driver and executor nodes, but it only happens on the driver. While this is clearly not an Avro specific problem, do you guys have any suggestions on how to address my problem before I close this issue?
Thanks, Nader
@brkyvz, do you have any insights into what might be happening here with --packages
?
@nadersalehi In order to double check:
In the last messages, you mentioned that you copied: com.databricksspark-avro2.11-2.0.1 to the executors, but in your --packages
command, you were using com.databricks:spark-avro_2.10:2.0.1. I wonder if this was a simple scala version, binary incompatibility problem
@brkyvz
Sorry, that was a typo; I actually copied com.databricks:spark-avro_2.10:2.0.1. The main issue here, IMHO, is that I don't see any spark-avro package on the executor nodes unless I manually copy them.
@JoshRosen @nadersalehi Then I think it may be an issue related to yarn being used with --packages
. I'll look into it.
@nadersalehi Having the same issue, using scala-shell on EMR 4.1.0, avro 1.7.5 needed but EMR has 1.7.4. I'm trying following:
spark-shell --jars avro-1.7.6.jar,spark-avro_2.11-2.0.1.jar,RedshiftJDBC41-1.1.7.1007.jar --packages org.apache.avro:avro:1.7.6,com.databricks:spark-avro_2.11:2.0.1,com.databricks:spark-redshift_2.10:0.5.1 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true
But yeah, it's not picking them up... did you have to put those in /home/hadoop/.ivy2/jar/ on all nodes?
Got it to work by simply deleting all avro jars off the cluster. It picks up /usr/lib/spark/lib/spark-assembly.jar in the classpath, which contains both avro and avro-mapred libs of 1.7.7 version -- that's the only one I don't delete.
# delete jars from master node
find / -name "*avro*jar" 2> /dev/null -print0 | xargs -0 -I file sudo rm file
# delete jars from slave nodes
yarn node -list | sed 's/ .*//g' | tail -n +3 | sed 's/:.*//g' | xargs -I node ssh node "find / -name "*avro*jar" 2> /dev/null -print0 | xargs -0 -I file sudo rm file"
I confirm this fixes the problem. Thanks Alex. I tried just the first part (on master node), got frustrated and quit.
EMR 4.1.0 has still avro 1.7.4 (IMO 1.7.5 needed), I filled a ticket with Amazon, hopefully they'll include it in the next EMR version.
Hi, Jonathan from EMR here. Part of the problem is that Hadoop depends upon Avro 1.7.4, and the full Hadoop classpath is included in the Spark path. Do you think it might help for us to upgrade Hadoop to Avro 1.7.7 to match with Spark's dependency? Is Avro backward-compatible such that we can even expect that this shouldn't break anything in Hadoop?
Unfortunately, as spark-avro is also used in spark-redshift to write into the database, it seems that the whole thing is broken on EMR 4.* :(
Any workaround for this issue? Deleting all file off the cluster works, but I couldn't find how to automate it (bootstrap actions happens too early in the process).
Yeah, whole writing doesn't work on EMR 4.*, Jonathan suggested setting spark-defaults classification of EMR cluster config -> spark.{driver,executor}.userClassPathFirst He wasn't sure it'd work and I wasn't successful trying it out but didn't try too hard. If you make it work, let me know.
We found a bit of a hacky workaround: we created a bootstrap action to copy the avro-1.7.7
jar into /usr/lib/hadoop-mapreduce
, then when Spark runs, this seems to take precedence over Hadoop's avro jar. I'm not completely sure why this works, but in our (limited) testing it seems to be consistent, maybe because the new jar is written first, so comes it up first in the directory listing.
It seems to be working indeed! If anyone needs a quick solution, I uploaded the following script to S3 and successfully use it as a bootstrap action.
#!/bin/bash
# Get Avro jar and copy it into /usr/lib/hadoop-mapreduce
cd /tmp/
wget http://ftp.cuhk.edu.hk/pub/packages/apache.org/avro/avro-1.7.7/java/avro-1.7.7.jar
sudo mkdir -p /usr/lib/hadoop-mapreduce
sudo cp avro-1.7.7.jar /usr/lib/hadoop-mapreduce
You might want to change the mirror to something closer to your instance.
Just a heads up that the above hack stopped working for us on emr-4.2.0 - it picks up the old avro jar ahead of the new one on the classpath. But it's easy enough to work around - just put it in a different directory, and include that directory at the beginning of spark.{driver,executor}.extraClassPath
.
@emlyn 's last solution worked for us for a while...until recently when this error popped up again, despite us using avro-1.7.7.jar (I checked the class path as mentioned here: https://github.com/databricks/spark-redshift/issues/129#issuecomment-160740317), so this really doesn't make much sense to me. We are on EMR 4.3.
I'm new to AWS and I'm kind of confused about the solution posted here. Can someone put in steps here in order to load/bootstrap spark-avro in my EMR cluster so that I can use it in Zeppelin notebook? (Since I am in an enterprise enviroment, I do not have access to the web to set it up in the cluster directly.)
I think that this issue is somewhat out of scope / not actionable by us right now. If someone has a set of instructions which work and are easy to follow, then please submit a PR to add them to the README. I'm going to close this issue for now in order to declutter the issues backlog, but please re-open if new issues of this type start occurring.
I just faced the same problem :
Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
I use following versions : EMR : 5.13.0 Spark : 2.3.0
In our Scala code we use this dependency
lazy val sparkAvroDependency = libraryDependencies +=
"com.databricks" %% "spark-avro" % "4.0.0"
I resolved the problem by replacing the avro jar version in /etc/spark/conf/spark-defaults.conf
spark.driver.extraClassPath /usr/lib/hadoop/lib/avro-1.7.4.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar
spark.executor.extraClassPath /usr/lib/hadoop/lib/avro-1.7.4.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar
by
spark.driver.extraClassPath /usr/lib/spark/jars/avro-1.7.7.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar
spark.executor.extraClassPath /usr/lib/spark/jars/avro-1.7.7.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar
I have a simple code which works without a problem when running on my laptop (which uses pre-built spark-1.4.1 hadoop 2.4). When running the same code in EMR 4.0, the code crashes when it tries to execute the following line
Stack trace is provided below: