databricks / spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
GNU General Public License v3.0
422 stars 120 forks source link

Failed on empty strings - sentiment analysis #45

Open xinlutu2 opened 4 years ago

xinlutu2 commented 4 years ago
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

val input = Seq(
  (1, "Stanford University is located in California. It is a great university"),
  (2, "")
).toDF("id", "text")

input.withColumn("sentiment", sentiment($"text")).show()

Error: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$sentiment$1: (string) => int)

Any plans to catch this issue properly?

timrourke commented 4 years ago

@xinlutu2 Not a contributor or maintainer of spark-corenlp, but I suspect you're on your own to deal with empty strings. This library is a very thin wrapper around CoreNLP, and wouldn't be the place to handle empty strings.

https://github.com/databricks/spark-corenlp/blob/master/src/main/scala/com/databricks/spark/corenlp/functions.scala#L157

One option might be to use filter on your data to remove rows where your column is empty.