bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

txt_nextgram leaves NA values #14

Closed pverspeelt closed 6 years ago

pverspeelt commented 6 years ago

If you use the txt_nextgram function depending on the n-gram you choose you get n-1 NA values.

x <- sprintf("%s%s", LETTERS, 1:26)
txt_nextgram(x, n = 2)
[1] "A1 B2"   "B2 C3"   "C3 D4"   "D4 E5"   "E5 F6"   "F6 G7"   "G7 H8"   "H8 I9"   "I9 J10"  "J10 K11" "K11 L12" "L12 M13" "M13 N14"
[14] "N14 O15" "O15 P16" "P16 Q17" "Q17 R18" "R18 S19" "S19 T20" "T20 U21" "U21 V22" "V22 W23" "W23 X24" "X24 Y25" "Y25 Z26" NA    

Since the NA's are not meaningful when creating n-grams like this, removing them might be the best option. Just adding out <- out[!is.na(out)] to the function would to the trick.

jwijffels commented 6 years ago

It's exactly the purpose of txt_nextgram to return a vector of the same length as the input vector, which as a consequence gives NA values at the end of the vector. So that you can easily do something like the following, where an extra column is added on the data.frame by sentence. And you can do further data processing with the extra column.

library(udpipe)
library(data.table)
ud_model <- udpipe_download_model("english")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model,
                     "It's exactly the purpose of txt_nextgram to return a vector of the same length as the input vector, which as a consequence gives NA values at the end of the vector.
                      So that you can easily do something like the following, where an extra column is added on the data.frame by sentence. 
                      And you can do further data processing with the extra column.")
x <- as.data.table(x)
x <- x[, trigram := txt_nextgram(lemma, n = 3), by = list(doc_id, paragraph_id, sentence_id)]
x[, c("doc_id", "paragraph_id", "sentence_id", "lemma", "trigram")]
    doc_id paragraph_id sentence_id        lemma                 trigram
 1:   doc1            1           1           it           it be exactly
 2:   doc1            1           1           be          be exactly the
 3:   doc1            1           1      exactly     exactly the purpose
 4:   doc1            1           1          the          the purpose of
 5:   doc1            1           1      purpose purpose of txt_nextgram
 6:   doc1            1           1           of      of txt_nextgram to
 7:   doc1            1           1 txt_nextgram  txt_nextgram to return
 8:   doc1            1           1           to             to return a
 9:   doc1            1           1       return         return a vector
10:   doc1            1           1            a             a vector of
11:   doc1            1           1       vector           vector of the
12:   doc1            1           1           of             of the same
13:   doc1            1           1          the         the same length
14:   doc1            1           1         same          same length as
15:   doc1            1           1       length           length as the
16:   doc1            1           1           as            as the input
17:   doc1            1           1          the        the input vector
18:   doc1            1           1        input          input vector .
19:   doc1            1           1       vector                      NA
20:   doc1            1           1            .                      NA
21:   doc1            1           2           so             so that you
22:   doc1            1           2         that            that you can
23:   doc1            1           2          you          you can easily
24:   doc1            1           2          can           can easily do
25:   doc1            1           2       easily     easily do something
26:   doc1            1           2           do       do something like
27:   doc1            1           2    something      something like the
28:   doc1            1           2         like         like the follow
29:   doc1            1           2          the            the follow .
30:   doc1            1           2       follow                      NA
31:   doc1            1           2            .                      NA

The example in the doc basically shows that behaviour.

pverspeelt commented 6 years ago

Fair point. I tend to use the n-grams separately from all the other info in the data.table / data.frame. I will close the issue.