Closed hahahannes closed 3 years ago
Hi,
I'm also thankful for the R scripts. I'm encountering the same issue. Did anyone find a solution? It's happening on a mac and PC, and both are using very recent versions of everything (including R).
Thank you for any insights.
Annick
Hey @hahahannes and @afntanguay - my apologies for the delay, I don't think I got a github notification for the first message.
This is likely an error with TreeTagger, which is a giant pain to get to work correctly. Since the publication of this paper, I've found the package udpipe
is much easier to use with similar results.
I don't have an example built into this chain of analysis just yet but here's a example:
library(udpipe)
vector_of_text <- c("this is a sentence", "he was singing", "she helped beautifully")
tagged <- udpipe(vector_of_text, object = "english")
head(tagged)
Made up some random sentences here - there are pros and cons of the output but it saves as a nice dataframe, does lemmatization somewhat well, has many languages, and is thankfully pretty darned easy to run. Let me know what questions you have.
Thank you @doomlab . That works and it's indeed very easy to run! I still need to take a closer look at the data, etc., but here's tweak and I did it so I could use the next parts mostly as is (though not sure if multi-word will work yet). Also I don't know R, so I wouldn't swear any of this is correct, but maybe it will make things easier for people in the same boat.
library('udpipe')
X <- read.csv("C:\\path\\1rawdata_GS_PerFeature.csv", header = TRUE, sep = ",", stringsAsFactors = F, encoding = "UTF-8")
names(X) <-c("doc_id","text")
## Lower case to normalize
X$text <- tolower(X$text)
#Read spelling dic
spelling.dict <- read.csv("C:\\path\\4spelling.dict.checked_GS_PerFeature.csv", stringsAsFactors = F)
#This is where correction for spelling errors happens
X$corrected <- stri_replace_all_regex(str = X$text, pattern = spelling.dict$spelling.pattern, replacement = spelling.dict$spelling.sugg, vectorize_all = FALSE)
# Rename columns fur udpipe
X <- X %>%
select(doc_id, corrected) %>%
rename(doc_id=doc_id, text=corrected)
#Do tagging with udpipe
tagged <- udpipe(X, object= "english")
#Write lemmatized data
write.csv(x = tagged ,file = "C:\\path\\6lemmatized_Features.csv", fileEncoding = "utf8", row.names = F)
# Clean up look is not necessary with udpipe because lemma have the features if it's unknown. We did got back to the spelling.dict to reduce the number of unknowns which are indicated with an X in udpipe.
(And also to let udpipe do the tokenization, I skipped or moved the steps below):
# Parse features
tokens <- unnest_tokens(tbl = X, output = token,
input = feature_response, token = stringr::str_split,pattern = " |\\, |\\.|\\,|\\;")
tokens$token <- trimws(tokens$token, which = c("both", "left", "right"), whitespace = "[ \t\r\n]")
# Remove empty features
tokens <- tokens[!tokens$token =="", ]
tokens$corrected <- stri_replace_all_regex(str = tokens$token, pattern = spelling.dict$spelling.pattern, replacement = spelling.dict$spelling.sugg, vectorize_all = FALSE)
# Rename columns
tokens <- tokens %>%
rename(feature = corrected) %>%
select(cue, feature)
@afntanguay - that looks great! I'm actually working on reprocessing my whole set of data (rather than the small subset here), so I'll incorporate this update. I'll close the issue but add a link to the readme so people can access it.
Hi,
thank you for this interesting approach and the provided R scripts. I am struggling a little bit to get them running. I get the error, that there is no slot for
TT.res
in thekRp.text
class in the lemmatazation.R script at this line: https://github.com/doomlab/FLT-Primer/blob/master/R/lemmatization.R#L29.I am not proficient with R so my guess is that it is caused by some version problems. Can you maybe specify the version of dependecies you used?
Thank you very much!