Closed pverspeelt closed 6 years ago
It's exactly the purpose of txt_nextgram to return a vector of the same length as the input vector, which as a consequence gives NA values at the end of the vector. So that you can easily do something like the following, where an extra column is added on the data.frame by sentence. And you can do further data processing with the extra column.
library(udpipe)
library(data.table)
ud_model <- udpipe_download_model("english")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model,
"It's exactly the purpose of txt_nextgram to return a vector of the same length as the input vector, which as a consequence gives NA values at the end of the vector.
So that you can easily do something like the following, where an extra column is added on the data.frame by sentence.
And you can do further data processing with the extra column.")
x <- as.data.table(x)
x <- x[, trigram := txt_nextgram(lemma, n = 3), by = list(doc_id, paragraph_id, sentence_id)]
x[, c("doc_id", "paragraph_id", "sentence_id", "lemma", "trigram")]
doc_id paragraph_id sentence_id lemma trigram
1: doc1 1 1 it it be exactly
2: doc1 1 1 be be exactly the
3: doc1 1 1 exactly exactly the purpose
4: doc1 1 1 the the purpose of
5: doc1 1 1 purpose purpose of txt_nextgram
6: doc1 1 1 of of txt_nextgram to
7: doc1 1 1 txt_nextgram txt_nextgram to return
8: doc1 1 1 to to return a
9: doc1 1 1 return return a vector
10: doc1 1 1 a a vector of
11: doc1 1 1 vector vector of the
12: doc1 1 1 of of the same
13: doc1 1 1 the the same length
14: doc1 1 1 same same length as
15: doc1 1 1 length length as the
16: doc1 1 1 as as the input
17: doc1 1 1 the the input vector
18: doc1 1 1 input input vector .
19: doc1 1 1 vector NA
20: doc1 1 1 . NA
21: doc1 1 2 so so that you
22: doc1 1 2 that that you can
23: doc1 1 2 you you can easily
24: doc1 1 2 can can easily do
25: doc1 1 2 easily easily do something
26: doc1 1 2 do do something like
27: doc1 1 2 something something like the
28: doc1 1 2 like like the follow
29: doc1 1 2 the the follow .
30: doc1 1 2 follow NA
31: doc1 1 2 . NA
The example in the doc basically shows that behaviour.
Fair point. I tend to use the n-grams separately from all the other info in the data.table / data.frame. I will close the issue.
If you use the txt_nextgram function depending on the n-gram you choose you get n-1 NA values.
Since the NA's are not meaningful when creating n-grams like this, removing them might be the best option. Just adding
out <- out[!is.na(out)]
to the function would to the trick.