databricks / spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
GNU General Public License v3.0
422 stars 120 forks source link

Example Program Issue #18

Open anuj-malhotra opened 7 years ago

anuj-malhotra commented 7 years ago

Hi, I am trying to run the below example program with Spark 1.6 and Java 1.8.060 import org.apache.spark.sql.functions. import com.databricks.spark.corenlp.functions. import sqlContext.implicits.

val input = Seq( (1, "Stanford University is located in California. It is a great university.") ).toDF("id", "text")

val output = input .select(cleanxml('text).as('doc)) .select(explode(ssplit('doc)).as('sen)) .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

Its throwing exception on assigning Output variable; error is as - error: bad symbolic reference. A signature in functions.class refers to type UserDefinedFunction in package org.apache.spark.sql.expressions which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling functions.class.

:36: error: org.apache.spark.sql.expressions.UserDefinedFunction does not take parameters val output = input.select(cleanxml('text).as('doc)).select(explode(ssplit('doc)).as('sen)).select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment)) Can you please advise where I am making mistake ?
anuj-malhotra commented 7 years ago

@mengxr - Could you please help on what I could be doing wrong in the above code ?

Initially I am trying this on spark-shell I started the spark-shell using below command JAVA_HOME=/usr/java/jdk1.8.0_60/ spark-shell --packages databricks:spark-corenlp:0.2.0-s_2.10,edu.stanford.nlp:stanford-corenlp:3.6.0

I also tried with this piece of code

CoreNLP coreNLP = new CoreNLP() .setInputCol("text") .setAnnotators(new String[]{"tokenize", "ssplit", "lemma"}) .setFlattenNestedFields(new String[]{"sentence_token_word"}) .setOutputCol("parsed") val outputDF = coreNLP.transform(input)

This as well doesn't works as the spark isn't able to locate CoreNLP (giving error as [error: not found: type CoreNLP] ). Could you help on which extra library I need to add or any correction in the code.

jweinsteincbt commented 7 years ago

@anuj-malhotra You need to pass the CoreNLP models jar file to spark:

spark-shell --jars lib/stanford-corenlp/stanford-corenlp-3.6.0-models.jar \
    --packages databricks:spark-corenlp:0.2.0-s_2.11,edu.stanford.nlp:stanford-corenlp:3.6.0

Worked with Spark 2.0.0 and Scala 2.11

You would probably need an earlier version than databricks:spark-corenlp:0.2.0-s_2.11 to support Spark 1.6. (PS: You can't run Java code in spark-shell, but you can run it with spark-submit once compiled)

lucy3 commented 6 years ago

I had this error, too. I ended up just copying each udf I wanted to use into my code (with the appropriate import statements).

import java.util.Properties

import scala.collection.JavaConverters._

import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{CleanXmlAnnotator, StanfordCoreNLP}
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentiment
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import edu.stanford.nlp.simple.{Document, Sentence}
import edu.stanford.nlp.util.Quadruple
import edu.stanford.nlp.trees.Tree

import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import sqlContext.implicits._

def ssplit = udf { document: String =>
    new Document(document).sentences().asScala.map(_.text())
}

val input = Seq(
    (1, "Pies are delicious. Pi day is March 14.")
).toDF("id", "text")

val output = input.select(col("text"), explode(ssplit(col("text"))).as("sent"))

output.show()

using the spark-shell command

spark-shell --master yarn --packages databricks:spark-corenlp:0.2.0-s_2.11 --jars lib/stanford-corenlp-3.9.1-models.jar 

where "lib" can be replaced with where ever your model jar resides.