Closed karthikappy closed 12 months ago
@karthik892 Can you try Sedona 1.4.1? Does the same problem exist?
I will give it a try with 1.4.1 ... but in the mean time, I have managed to get this working with sedona 1.1 ... sort of :
I copied the following files into the $SPARK_HOME/jars directory:
geotools-wrapper-1.5.0-28.2.jar
apache-sedona-1.5.0-bin/sedona-spark-shaded-3.0_2.12-1.5.0.jar
I have added the following into my python code
.config('spark.jars.packages',
'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.0-incubating,' +
'org.datasyslab:geotools-wrapper:1.5.0-28.2'
)
geotools-wrapper bundles an old version of commons-text as well as some other Apache Commons libraries, which may cause JAR conflicts like this. I'm submitting a PR to remove Apache Commons library from geotools-wrapper, so that the problem could go away.
Another thing to mention is that specifying dependency packages in spark.jars.packages
is just enough to setup Apache Sedona. There is no need to copy the JARs into the $SPARK_HOME/jars directory.
I'm encountering this error on spark scala project, it's so weird that gradle dependency shows that commons-text version is 1.10.0.
Here are the full logs.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.text.StringSubstitutor.setEnableUndefinedVariableException(Z)Lorg/apache/commons/text/StringSubstitutor;
at org.apache.spark.ErrorClassesJsonReader.getErrorMessage(ErrorClassesJSONReader.scala:49)
at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:55)
at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:42)
at org.apache.spark.sql.AnalysisException.<init>(AnalysisException.scala:80)
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
at org.apache.spark.sql.Dataset.write(Dataset.scala:3833)
plugins {
id 'java'
}
ext {
scalaVersion = "2.12"
sparkVersion = "3.4.1"
}
group 'com.vdtdata'
version '1.0-SNAPSHOT'
apply plugin: 'java'
apply plugin: 'scala'
repositories {
jcenter()
mavenLocal()
mavenCentral()
maven {
url "https://oss.sonatype.org/content/repositories/snapshots"
}
maven {
url "https://repository.cloudera.com/artifactory/cloudera-repos/"
}
}
dependencies {
implementation "org.scala-lang:scala-library:${project.ext.scalaVersion}.10"
implementation "org.scala-lang:scala-reflect:${project.ext.scalaVersion}.10"
implementation "org.scala-lang:scala-compiler:${project.ext.scalaVersion}.10"
implementation "org.apache.hudi:hudi-spark3.4-bundle_${project.ext.scalaVersion}:0.14.0"
implementation "org.apache.hadoop:hadoop-aws:3.3.4"
implementation("org.apache.sedona:sedona-spark-shaded-3.4_${project.ext.scalaVersion}:1.5.0")
implementation("org.datasyslab:geotools-wrapper:1.5.0-28.2")
implementation("org.postgresql:postgresql:42.7.1")
implementation "org.apache.spark:spark-mllib_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-sql_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-graphx_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-launcher_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-catalyst_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-streaming_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-core_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation "org.apache.spark:spark-hive_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
implementation 'org.apache.commons:commons-text:1.10.0' // also try removing this line without avail
implementation group: 'commons-io', name: 'commons-io', version: '2.11'
}
gradle -q dependencyInsight --dependency org.apache.commons:commons-text
org.apache.commons:commons-text:1.10.0
Variant compile:
| Attribute Name | Provided | Requested |
|--------------------------------|----------|--------------|
| org.gradle.status | release | |
| org.gradle.category | library | library |
| org.gradle.libraryelements | jar | classes |
| org.gradle.usage | java-api | java-api |
| org.gradle.dependency.bundling | | external |
| org.gradle.jvm.environment | | standard-jvm |
| org.gradle.jvm.version | | 8 |
org.apache.commons:commons-text:1.10.0
+--- compileClasspath
\--- org.apache.spark:spark-core_2.12:3.4.1
+--- compileClasspath
+--- org.apache.spark:spark-mllib_2.12:3.4.1
| \--- compileClasspath
+--- org.apache.spark:spark-streaming_2.12:3.4.1
| +--- compileClasspath
| \--- org.apache.spark:spark-mllib_2.12:3.4.1 (*)
+--- org.apache.spark:spark-hive_2.12:3.4.1
| \--- compileClasspath
+--- org.apache.spark:spark-sql_2.12:3.4.1
| +--- compileClasspath
| +--- org.apache.spark:spark-mllib_2.12:3.4.1 (*)
| \--- org.apache.spark:spark-hive_2.12:3.4.1 (*)
+--- org.apache.spark:spark-graphx_2.12:3.4.1
| +--- compileClasspath
| \--- org.apache.spark:spark-mllib_2.12:3.4.1 (*)
\--- org.apache.spark:spark-catalyst_2.12:3.4.1
+--- compileClasspath
\--- org.apache.spark:spark-sql_2.12:3.4.1 (*)
(*) - Indicates repeated occurrences of a transitive dependency subtree. Gradle expands transitive dependency subtrees only once per project; repeat occurrences only display the root of the subtree, followed by this annotation.
A web-based, searchable dependency report is available by adding the --scan option.
@zachtrong This might be caused by the shaded jar of Sedona which shaded the commons
. Since you are using gradle which can handle maven coordinates, please use the unshaded version of Sedona sedona-spark
(https://sedona.apache.org/1.5.0/setup/maven-coordinates/#use-sedona-unshaded-jars)
If it still has the problem, please remove the geotools-wrapper
from the dependencies as it might also shade commons
. If you run into any issue about missing geotools, please add geotools dependencies manually: https://github.com/jiayuasu/geotools-wrapper/blob/main/pom.xml#L74
I ended up building the latest geowrapper jar version 1.5.1 which fixes the dependencies issues.
@zachtrong Cool. This could be a solution too. I also updated our Scala template project to show the correct dependency setting if users do it via maven, SBT or gradle: https://github.com/apache/sedona/blob/master/examples/spark-sql/pom.xml#L61
I have created a python application that does the following
.withColumn("geo_point", F.expr("ST_POINT(...)"))
createOrReplaceTempView("A")
.withColumn("geometry", F.expr("ST_GeomFromWKT(wkt_geometry)"))
createOrReplaceTempView("B")
ST_Intersects
and display the resultsExpected behavior
I expect to see the result of my spatial join, or an empty result set, or an SQL error
Actual behavior
I receive the following error, usually at step 4 (Is Sedona automatically detecting the WKT string?):
Steps to reproduce the problem
See steps listed above
Settings
Sedona version = 1.5.0
Apache Spark version = 3.4.1 (also tried 3.5.0)
Apache Flink version = Not using Flink
API type = Python
Scala version = 2.12
JRE version = 11.0.1 (OpenJDK)
Python version = 3.8.10
Environment = Standalone ... sort of ... I am running a Virtual Hadoop cluster consisting of 5 VirtualBox VM's running Hadoop. That's 1 Name node and 4 Data Nodes. I am also using vagrant to make the process of creating and destroying these VM's much easier. I am using Ubuntu 20.04 and Hadoop 3.3.6
I also have an SO question trying to find an answer to this problem - https://stackoverflow.com/questions/77326231/receiving-nosuchmethoderror-when-running-sql-query-in-a-pyspark-application-wi