bnosac / pattern.nlp

R package to perform sentiment analysis and Parts of Speech tagging for Dutch/French/English/German/Spanish/Italian
Other
67 stars 17 forks source link

How to apply to large dataframe #13

Closed frederatic closed 5 years ago

frederatic commented 5 years ago

I want to apply the pattern_sentiment function to a dataframe of tweets, specifically to 1 column containing the text. I used a for loop and binds first, but this takes a long time on millions of rows, so then I tried the apply function, but for some reason it does not work, returning 0.0 for every value. I am just a beginner at R, so can anyone help me with a solution?

Here the 2 methods I tried:

1. Uses loop and binds

library(pattern.nlp)

Apply function to every row and output the results and bind

for (x in 1:4000000) {sentiments <- rbind(sentiments, (pattern_sentiment(tweets$text_clean[x], language="dutch")))}

Bind dataframe of polarity, subjectivity and id with the original tweets dataframe

tweets <- cbind(tweets, sentiments)

2. Use apply

library(pattern.nlp)

sentiment_function <- function(x) { pattern_sentiment(x, language="dutch") }

sentiments <- apply(tweets['text_clean'], 1, sentiment_function)

frederatic commented 5 years ago

Fixed using lapply

jwijffels commented 5 years ago

Use lapply and next data.table::rbindlist

Missfortunate commented 4 years ago

Hey, I'm trying to apply this solution to my own data set, but it doesn't seem to work.

I have a list that looks like this: list("deze man in haalt alvast de eu vlag weg en zet de russische in de plaats <U+30FC> ", "solidariteit cubaitalië on point nu nog europa onderling be ", "bezoekersrichtlijn ggz geen algehele bezoekersstop ggz samen met hebben we een aangepaste richtlijn op", "bij het om gebuiken ze het blijkbaar om op hun luie gat te liggen of achterstallig onderhoud van het", "app die in vele talen kan vertellen of je echt tekenen van hebt test ")

And tried to use lapply and the function created by frederatic: sentiments <- lapply(try, sentiment_function)

However this gives me 5 dataframes with the correct columns, but it contains gibberish. For example the polarity column = ÿþ0ÿþ

When I use the function on just 1 string: pattern_sentiment(try[[5]], language = "dutch") it does seem to work.

I'm not very familiar with lapply, so not quite sure what I'm doing wrong.

frederatic commented 4 years ago

It's been a while, but I think this was my code. Did you assign the function to a variable like below?

sentiment_function <- function(x) {
  pattern_sentiment(x, language="dutch")
}

sentiments <- lapply(LISTNAME, sentiment_function)

U should get a list called sentiments that has the scores for every sentence. I used a dataframe instead, so got a dataframe as output. And then just bind the original tweets dataframe with the sentiment dataframe. Hope it works.

Missfortunate commented 4 years ago

Hey,

That does indeed work, however for some reason it wraps the numbers in unexpected symbols like this:

image

I guess I can extract the contents with something like stringr, but it just seems odd to me that it behaves this way.

frederatic commented 4 years ago

Quick google showed it might be a bug. Maybe these links help:

How to remove display of strange characters in R-Markdown chunk output? Github Issue