apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.95k stars 695 forks source link

Receiving NoSuchMethodError: org.apache.commons.text.StringSubstitutor.setEnableUndefinedVariableException #1055

Closed karthikappy closed 12 months ago

karthikappy commented 1 year ago

I have created a python application that does the following

  1. Read a CSV file containing a series on lat, long values
  2. Create a point geometry column from the lat long using .withColumn("geo_point", F.expr("ST_POINT(...)"))
  3. Save table A using createOrReplaceTempView("A")
  4. Read in another CSV file which contains a series of polygons defined in WKT format
  5. Create a polygon geometry column from the wkt using .withColumn("geometry", F.expr("ST_GeomFromWKT(wkt_geometry)"))
  6. Save table B using createOrReplaceTempView("B")
  7. Run a query using ST_Intersects and display the results

Expected behavior

I expect to see the result of my spatial join, or an empty result set, or an SQL error

Actual behavior

I receive the following error, usually at step 4 (Is Sedona automatically detecting the WKT string?):

py4j.protocol.Py4JJavaError: An error occurred while calling o34.sql.
: java.lang.NoSuchMethodError: org.apache.commons.text.StringSubstitutor.setEnableUndefinedVariableException(Z)Lorg/apache/commons/text/StringSubstitutor;
        at org.apache.spark.ErrorClassesJsonReader.getErrorMessage(ErrorClassesJSONReader.scala:49)
...

Steps to reproduce the problem

See steps listed above

Settings

Sedona version = 1.5.0

Apache Spark version = 3.4.1 (also tried 3.5.0)

Apache Flink version = Not using Flink

API type = Python

Scala version = 2.12

JRE version = 11.0.1 (OpenJDK)

Python version = 3.8.10

Environment = Standalone ... sort of ... I am running a Virtual Hadoop cluster consisting of 5 VirtualBox VM's running Hadoop. That's 1 Name node and 4 Data Nodes. I am also using vagrant to make the process of creating and destroying these VM's much easier. I am using Ubuntu 20.04 and Hadoop 3.3.6

I also have an SO question trying to find an answer to this problem - https://stackoverflow.com/questions/77326231/receiving-nosuchmethoderror-when-running-sql-query-in-a-pyspark-application-wi

jiayuasu commented 1 year ago

@karthik892 Can you try Sedona 1.4.1? Does the same problem exist?

karthikappy commented 1 year ago

I will give it a try with 1.4.1 ... but in the mean time, I have managed to get this working with sedona 1.1 ... sort of :

I copied the following files into the $SPARK_HOME/jars directory:

geotools-wrapper-1.5.0-28.2.jar 
apache-sedona-1.5.0-bin/sedona-spark-shaded-3.0_2.12-1.5.0.jar

I have added the following into my python code

.config('spark.jars.packages',
                    'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.0-incubating,' +
                    'org.datasyslab:geotools-wrapper:1.5.0-28.2'
                )
Kontinuation commented 1 year ago

geotools-wrapper bundles an old version of commons-text as well as some other Apache Commons libraries, which may cause JAR conflicts like this. I'm submitting a PR to remove Apache Commons library from geotools-wrapper, so that the problem could go away.

Another thing to mention is that specifying dependency packages in spark.jars.packages is just enough to setup Apache Sedona. There is no need to copy the JARs into the $SPARK_HOME/jars directory.

zachtrong commented 11 months ago

I'm encountering this error on spark scala project, it's so weird that gradle dependency shows that commons-text version is 1.10.0.

Here are the full logs.

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.text.StringSubstitutor.setEnableUndefinedVariableException(Z)Lorg/apache/commons/text/StringSubstitutor;
    at org.apache.spark.ErrorClassesJsonReader.getErrorMessage(ErrorClassesJSONReader.scala:49)
    at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:55)
    at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:42)
    at org.apache.spark.sql.AnalysisException.<init>(AnalysisException.scala:80)
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
    at org.apache.spark.sql.Dataset.write(Dataset.scala:3833)
plugins {
    id 'java'
}

ext {
    scalaVersion = "2.12"
    sparkVersion = "3.4.1"
}

group 'com.vdtdata'
version '1.0-SNAPSHOT'

apply plugin: 'java'
apply plugin: 'scala'

repositories {
    jcenter()
    mavenLocal()
    mavenCentral()
    maven {
        url "https://oss.sonatype.org/content/repositories/snapshots"
    }
    maven {
        url "https://repository.cloudera.com/artifactory/cloudera-repos/"
    }
}

dependencies {
    implementation "org.scala-lang:scala-library:${project.ext.scalaVersion}.10"
    implementation "org.scala-lang:scala-reflect:${project.ext.scalaVersion}.10"
    implementation "org.scala-lang:scala-compiler:${project.ext.scalaVersion}.10"
    implementation "org.apache.hudi:hudi-spark3.4-bundle_${project.ext.scalaVersion}:0.14.0"
    implementation "org.apache.hadoop:hadoop-aws:3.3.4"
    implementation("org.apache.sedona:sedona-spark-shaded-3.4_${project.ext.scalaVersion}:1.5.0")
    implementation("org.datasyslab:geotools-wrapper:1.5.0-28.2")
    implementation("org.postgresql:postgresql:42.7.1")

    implementation "org.apache.spark:spark-mllib_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-sql_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-graphx_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-launcher_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-catalyst_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-streaming_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-core_${project.ext.scalaVersion}:${project.ext.sparkVersion}"
    implementation "org.apache.spark:spark-hive_${project.ext.scalaVersion}:${project.ext.sparkVersion}"

    implementation 'org.apache.commons:commons-text:1.10.0' // also try removing this line without avail
    implementation group: 'commons-io', name: 'commons-io', version: '2.11'
}
gradle -q dependencyInsight --dependency org.apache.commons:commons-text 
org.apache.commons:commons-text:1.10.0
  Variant compile:
    | Attribute Name                 | Provided | Requested    |
    |--------------------------------|----------|--------------|
    | org.gradle.status              | release  |              |
    | org.gradle.category            | library  | library      |
    | org.gradle.libraryelements     | jar      | classes      |
    | org.gradle.usage               | java-api | java-api     |
    | org.gradle.dependency.bundling |          | external     |
    | org.gradle.jvm.environment     |          | standard-jvm |
    | org.gradle.jvm.version         |          | 8            |

org.apache.commons:commons-text:1.10.0
+--- compileClasspath
\--- org.apache.spark:spark-core_2.12:3.4.1
     +--- compileClasspath
     +--- org.apache.spark:spark-mllib_2.12:3.4.1
     |    \--- compileClasspath
     +--- org.apache.spark:spark-streaming_2.12:3.4.1
     |    +--- compileClasspath
     |    \--- org.apache.spark:spark-mllib_2.12:3.4.1 (*)
     +--- org.apache.spark:spark-hive_2.12:3.4.1
     |    \--- compileClasspath
     +--- org.apache.spark:spark-sql_2.12:3.4.1
     |    +--- compileClasspath
     |    +--- org.apache.spark:spark-mllib_2.12:3.4.1 (*)
     |    \--- org.apache.spark:spark-hive_2.12:3.4.1 (*)
     +--- org.apache.spark:spark-graphx_2.12:3.4.1
     |    +--- compileClasspath
     |    \--- org.apache.spark:spark-mllib_2.12:3.4.1 (*)
     \--- org.apache.spark:spark-catalyst_2.12:3.4.1
          +--- compileClasspath
          \--- org.apache.spark:spark-sql_2.12:3.4.1 (*)

(*) - Indicates repeated occurrences of a transitive dependency subtree. Gradle expands transitive dependency subtrees only once per project; repeat occurrences only display the root of the subtree, followed by this annotation.

A web-based, searchable dependency report is available by adding the --scan option.
jiayuasu commented 11 months ago

@zachtrong This might be caused by the shaded jar of Sedona which shaded the commons. Since you are using gradle which can handle maven coordinates, please use the unshaded version of Sedona sedona-spark (https://sedona.apache.org/1.5.0/setup/maven-coordinates/#use-sedona-unshaded-jars)

If it still has the problem, please remove the geotools-wrapper from the dependencies as it might also shade commons. If you run into any issue about missing geotools, please add geotools dependencies manually: https://github.com/jiayuasu/geotools-wrapper/blob/main/pom.xml#L74

zachtrong commented 11 months ago

I ended up building the latest geowrapper jar version 1.5.1 which fixes the dependencies issues.

jiayuasu commented 10 months ago

@zachtrong Cool. This could be a solution too. I also updated our Scala template project to show the correct dependency setting if users do it via maven, SBT or gradle: https://github.com/apache/sedona/blob/master/examples/spark-sql/pom.xml#L61