databricks / spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
GNU General Public License v3.0
422 stars 120 forks source link

Constituency Parsing ? #21

Open shopuz opened 7 years ago

shopuz commented 7 years ago

I can see that there is a function defined for dependency parsing depparse. However I can't see if Constituency Parsing parse in the list of functions. Is there any way I can get the constituency parsing ?

lucy3 commented 6 years ago

I'm wondering this too!

It would be an additional user defined function in the function file or whatever file you're working in (as long as you have all of the necessary import statements).

def parse = udf { sentence: String =>
    new Sentence(sentence).parse().asScala.map(_.toString).mkString(" ")
}

and you would use it as

val input = Seq(
    (1, "Stanford is located in California. There are sometimes mountain lions on campus.")
).toDF("id", "quote")

val output = input.select(col("quote"), explode(ssplit(col("quote"))).as("sent")).select(col("quote"), col("sent"), parse(col("sent")).as("parse"))

output.show()

(Edited this comment to be more correct after I played around with it in spark-shell.)

phuongnm94 commented 3 years ago

Thanks @lucy3 I tried and run bellow code in sparkshell, the output is a little better:

import java.util.Properties

import scala.collection.JavaConverters._

import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, CleanXmlAnnotator, StanfordCoreNLP, TokenizerAnnotator}
import edu.stanford.nlp.pipeline.CoreNLPProtos.Sentiment
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import edu.stanford.nlp.simple.{Document, Sentence}
import edu.stanford.nlp.util.Quadruple
import org.apache.spark.sql.functions.udf

import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

def parse = udf { sentence: String => 
  new Sentence(sentence).parse().pennString().replace("\n", "")
}

and similar to @lucy3 it can be used as:

val input = Seq(
    (1, "Stanford is located in California. There are sometimes mountain lions on campus.")
).toDF("id", "quote")
val output = input.select(col("quote"), explode(ssplit(col("quote"))).as("sent")).select(col("quote"), col("sent"), parse(col("sent")).as("parse"))
output.show()