I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.
val input = Seq(
(1, "Apple is located in California. It is a great company."),
(2, "Google is located in California. It is a great company."),
(3, "Netflix is located in California. It is a great company.")
).toDF("id", "text")
input.show()
input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id| text|
+---+--------------------+
| 1|Apple is loc...|
| 2|Google is lo...|
| 3|Netflix is l...|
+---+--------------------+
I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
However, in the output below i have lost the connection back to the original dataframe row ids.
+--------------------+--------------------+--------------------+---------+
| sen| words| nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--------------------+--------------------+--------------------+---------+
Ideally, I want something like the following:
+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+
I have tried to create a UDF but am unable to make it work.
I have been using the Stanford CoreNLP wrapper for Apache Spark to do NEP analysis and found it works well. However, i want to extend the simple example to where I can map the analysis back to an original dataframe id. See below, I have added two more row to the simple example.
val input = Seq( (1, "Apple is located in California. It is a great company. "),
(2, "Google is located in California. It is a great company. "),
(3, "Netflix is located in California. It is a great company. ")
).toDF("id", "text")
input.show()
input: org.apache.spark.sql.DataFrame = [id: int, text: string] +---+--------------------+ | id| text| +---+--------------------+ | 1|Apple is loc...|
| 2|Google is lo...|
| 3|Netflix is l...|
+---+--------------------+
I can then run this dataframe through the Spark CoreNLP wrapper to do both sentiment and NEP analysis.
val output = input .select(cleanxml('text).as('doc)) .select(explode(ssplit('doc)).as('sen)) .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment)) However, in the output below i have lost the connection back to the original dataframe row ids.
+--------------------+--------------------+--------------------+---------+ | sen| words| nerTags|sentiment| +--------------------+--------------------+--------------------+---------+ |Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2| |It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| |Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3| |It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| |Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3| |It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| +--------------------+--------------------+--------------------+---------+ Ideally, I want something like the following:
+--+---------------------+--------------------+--------------------+---------+ |id| sen| words| nerTags|sentiment| +--+---------------------+--------------------+--------------------+---------+ | 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2| | 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| | 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3| | 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| | 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3| | 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4| +--+---------------------+--------------------+--------------------+---------+ I have tried to create a UDF but am unable to make it work.