bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Why not use data.table? #13

Closed emillykkejensen closed 6 years ago

emillykkejensen commented 6 years ago

Not as much an issue as a question. Since you already use data.table in your package, why not go all the way and use it more effectively in your functions?

I have tried doing a txt_freq using data.tables and even on a small dataset as mine, the speed difference is notable:

> system.time({
+   test1 <- txt_freq_DT(udpipe_test_data)
+ })
  bruger   system forløbet 
    0.00     0.00     0.03 
> system.time({
+   test2 <- txt_freq(udpipe_test_data$token)
+ })
  bruger   system forløbet 
    0.15     0.00     0.15 

The custom function I have used looks like this:

txt_freq_DT <- function(x, exclude = c(NA, NaN), order = TRUE){

  x <- as.data.table(x)
  x <- x[, list(freq = .N), by = list("key" = token)]
  x <- x[!(key %in% exclude)]
  x[, freq_pct := 100 * freq / sum(freq)]
  if(order) x <- x[order(-freq)]

  return(x)

}
jwijffels commented 6 years ago

Yes, you are probably right. txt_freq is using table instead of data.table.

Why was that? Originally my intention was to not make the package depend on any other package except Rcpp and I believe this function was created at that timepoint with that idea in mind. Later on I added extra utility functions to make the life of the analyst easier (e.g. to find keywords, do replacement of keywords, phrases, collocations, rake, ngrams based on output of udpipe_annotate, ...) where I couldn't get around using data.table as it provides quite a lot of data handling utilities, and is the same license and has no further package dependencies. That's how more data.table functionality slipped in. In general the idea is to have no further R package dependencies except packages distributed by base R, Rcpp and data.table. That's for example the reason why the textrank functionality is put into another R package (textrank on CRAN which depends in igraph) and I don't add plotting functions inside this package as that would increase more R package dependencies which I don't want and the world is filled with people with different opinions on which graphing library is the best.

Regarding the remark. Completely right if it's faster, we can certainly rewrite. The idea with these txt_ functions is to always have a character vector as the first input, not a data.table/data.frame. Are there other functions where you think improvements can be obtained? On which volume of data did you benchmark this?

emillykkejensen commented 6 years ago

I really like your thinking – too often have I installed a package only to find it downloading a ton of dependency packages, so cudos for that.

My dataset consists of 18.300 observations (words), but I’m just looking into udpipe for the first time now and since it really is a great package I’m gonna try it on a much bigger dataset – therefor the data.table rewrite comes in handy.

I can post here in the comment if I find other functions, you could consider rewriting to data.table if you like?

jwijffels commented 6 years ago

While looking at it again, I remember why I used table. I wanted to have the possibility to keep the order of factor levels in case x was a factor. And also that if it is a factor and some levels are not in the data, that it still gives back the level. Example

x <- sample(LETTERS, 1000, replace = TRUE)
x <- factor(x, levels = c("otherfactor", LETTERS, "UNKN"))
txt_freq(x, order = FALSE)
key freq freq_pct
1  otherfactor    0      0.0
2            A   50      5.0
3            B   40      4.0
4            C   46      4.6
5            D   49      4.9
6            E   43      4.3
7            F   34      3.4
8            G   29      2.9
9            H   31      3.1
10           I   41      4.1
11           J   35      3.5
12           K   37      3.7
13           L   45      4.5
14           M   43      4.3
15           N   40      4.0
16           O   33      3.3
17           P   34      3.4
18           Q   33      3.3
19           R   24      2.4
20           S   38      3.8
21           T   45      4.5
22           U   43      4.3
23           V   37      3.7
24           W   44      4.4
25           X   32      3.2
26           Y   34      3.4
27           Z   40      4.0
28        UNKN    0      0.0
emillykkejensen commented 6 years ago

Makes sense to want to keep the order from the input – and speaking of input I have added a smart (should I say so myself) little extra argument ‘freq_by’ (fell free to rename it, if you decide to use it), where you name the column you would like to order/list the frequency by (default is ‘token’).

txt_freq_DT <- function(x, freq_by = "token", exclude = c(NA, NaN), order = FALSE){

  x <- as.data.table(x)
  freq_by <- match.arg(freq_by, colnames(x))

  x <- x[, list(freq = .N), by = list("key" = get(freq_by))]
  x <- x[!(key %in% exclude)]
  x[, freq_pct := 100 * freq / sum(freq)]
  if(order) x <- x[order(-freq)]

  return(x)

}

If you set order to FALSE it will return the input order - factor or not and you can get the frequency of any column - lemma, upos, paragraph_id etc.

However, this changes the function to be a more general udpipe frequency function, then just a text frequency.

jwijffels commented 6 years ago

I've updated the function to use the data.table speedup you proposed in case the vector is a character and left the behaviour as is if it was a factor. See commit cf6a8e8ccee44291a89b7d59c19d109907fd7c08 I'm not going to incorporate the proposed solution to allow to give a data.frame as the first argument as all the txt_ functions have a vector as the first argument. Just do txt_freq(x$token)ortxt_freq(x$upos) there is no need to create a data.frame variant of that. Feel free to provide other feedback to other speedups which you think are usefull.