bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

dtm_remove_terms: Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions #45

Closed Hadsga closed 5 years ago

Hadsga commented 5 years ago

Hi,

I have created a dtm and removed the sparse terms.

library(tm)
library(dplyr)

samp = datsub %>%
  select(Reviews) %>%
  sample_n(2)

dtm = corpus = Corpus(VectorSource(samp$Reviews)) 

dtm = DocumentTermMatrix(corpus)

dtm = removeSparseTerms(dtm, 0.98)

However, some terms are still useless, so I tried:

library(udpipe)

useless_terms = c("buy")

dtm_remove_terms(dtm = dtm, terms = useless_terms)

But I get this error:

`Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions`
dput(samp)
structure(list(Reviews = c("problem electric connector appropriate fot phone", 
"great phone even good pricebe sure buy sim small sim model large type"
)), row.names = c(352L, 4907L), class = "data.frame")
jwijffels commented 5 years ago

Can you provide a reproducible example? Looks like your dtm has no more columns after you removed the useless_terms or it could also be that you need to call document_term_matrix on your dtm object before you call dtm_remove_terms, that makes sure your object is a sparse matrix and not a DocumentTermMatrix from the tm package

Hadsga commented 5 years ago

No, there are still columns after applying this function (see my edit). The manual says: "dtm - an object returned by document_term_matrix or an object of class DocumentTermMatrix". It should work with dtm from the tm package. However, your package doesn´t have a function to remove sparse terms, so I have to use thetm package. Eventually, you should change the text in the manual.

jwijffels commented 5 years ago

Can you please provide a reproducible example

jwijffels commented 5 years ago

You should do this:

library(tm)
library(udpipe)
samp <- structure(list(Reviews = c("problem electric connector appropriate fot phone", 
                                   "great phone even good pricebe sure buy sim small sim model large type")), 
                  row.names = c(352L, 4907L), class = "data.frame")
dtm = corpus = Corpus(VectorSource(samp$Reviews)) 
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.98)
dtm = document_term_matrix(dtm)
dtm_remove_terms(dtm, c("phone", "sim"))

the dtm_... function only work with sparse matrices of the Matrix package. If you want a replacement of removeSparseTerms, it's called dtm_remove_lowfreq