databricks / spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
GNU General Public License v3.0
422 stars 120 forks source link

How to index Spark CoreNLP analysis? #29

Open eoinlane opened 6 years ago

eoinlane commented 6 years ago

I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.

val input = Seq( (1, "Apple is located in California. It is a great company."), (2, "Google is located in California. It is a great company."), (3, "Netflix is located in California. It is a great company.") ).toDF("id", "text")

input.show()

input: org.apache.spark.sql.DataFrame = [id: int, text: string] +---+--------------------+ | id| text| +---+--------------------+ | 1|Apple is loc...| | 2|Google is lo...| | 3|Netflix is l...| +---+--------------------+ I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.

val output = input .select(cleanxml('text).as('doc)) .select(explode(ssplit('doc)).as('sen)) .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment)) However, in the output below i have lost the connection back to the original dataframe row ids.

+--------------------+--------------------+--------------------+---------+ | sen| words| nerTags|sentiment| +--------------------+--------------------+--------------------+---------+ |Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2| |It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| |Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3| |It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| |Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3| |It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| +--------------------+--------------------+--------------------+---------+ Ideally, I want something like the following:

+--+---------------------+--------------------+--------------------+---------+ |id| sen| words| nerTags|sentiment| +--+---------------------+--------------------+--------------------+---------+ | 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2| | 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| | 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3| | 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| | 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3| | 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| +--+---------------------+--------------------+--------------------+---------+ I have tried to create a UDF but am unable to make it work.