gesistsa / quanteda.proximity

📐 Proximity-based Weighting Scheme for the Quantitative Analysis of Textual Data
GNU General Public License v3.0
4 stars 0 forks source link

Optimization #20

Open chainsawriot opened 8 months ago

chainsawriot commented 8 months ago
require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens
bench::mark(tokens_proximity(toks, c("war")))

Takes almost 4s for only 59 documents.

chainsawriot commented 8 months ago

To my surprise, the slowest part is get_min (not .cal_dist)

https://github.com/gesistsa/quanteda.proximity/blob/34377dc140a02dd473e7be53f0f249c676d198ce/R/get_dist.R#L17

schochastics commented 8 months ago

(Disclaimer: Not familiar with quanteda)

It also depends on the Keyword used?

R> bench::mark(
       war = tokens_proximity(toks, c("war")),
       senate = tokens_proximity(toks, c("Senate")), check=FALSE
   )
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 war           2.63s    2.63s     0.380    6.96MB     12.9     1    34      2.63s <NULL> <Rprofmem>
2 senate     594.03ms 594.03ms     1.68     4.22MB     11.8     1     7   594.03ms <NULL> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>
schochastics commented 8 months ago

@chainsawriot suggestion for speedup:

#current version
f <- function(){
     res <- lapply(target_idx, .cal_dist, poss = poss)
     purrr::map_dbl(poss, .get_min, x = res) + count_from
}
#new version
g <- function(){
    res <- sapply(target_idx, .cal_dist, poss = poss)
    apply(res,1,min)+count_from
}

bench::mark(
    f(),
    g(),iterations = 100
)
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 f()         62.71ms  64.06ms      15.6    48.2KB     9.57    62    38      3.97s <dbl>  <Rprofmem>
2 g()          1.07ms   1.15ms     852.    132.5KB    17.4     98     2   114.99ms <dbl>  <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

Happy to make a PR if this would work for you

chainsawriot commented 8 months ago

@schochastics Yes. Because the calculation of proximity is skipped when the keyword is not found in the text. Try a word that occurs in all articles, e.g. "a". It takes 12s on my machine.

schochastics commented 8 months ago

with #22 and #23, we probably reached what is possible

require(quanteda)
require(quanteda.proximity)

toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"    94ms  106ms      7.28     150MB     47.3

Rprof()
tmp <- tokens_proximity(toks, c("a"))
Rprof(NULL)
summaryRprof()
#> $by.self
#>               self.time self.pct total.time total.pct
#> "<Anonymous>"      0.02    33.33       0.06    100.00
#> "as.vector"        0.02    33.33       0.02     33.33
#> "FUN"              0.02    33.33       0.02     33.33
#>
#> $by.total
#>                           total.time total.pct self.time self.pct
#> "<Anonymous>"                   0.06    100.00      0.02    33.33
#> ".f"                            0.06    100.00      0.00     0.00
#> "block_exec"                    0.06    100.00      0.00     0.00
#> "call_block"                    0.06    100.00      0.00     0.00
#> "call_with_cleanup"             0.06    100.00      0.00     0.00
#> "do.call"                       0.06    100.00      0.00     0.00
#> "doTryCatch"                    0.06    100.00      0.00     0.00
#> "eng_r"                         0.06    100.00      0.00     0.00
#> "eval_with_user_handlers"       0.06    100.00      0.00     0.00
#> "eval"                          0.06    100.00      0.00     0.00
#> "evaluate_call"                 0.06    100.00      0.00     0.00
#> "evaluate::evaluate"            0.06    100.00      0.00     0.00
#> "evaluate"                      0.06    100.00      0.00     0.00
#> "get_proximity"                 0.06    100.00      0.00     0.00
#> "handle_error"                  0.06    100.00      0.00     0.00
#> "handle"                        0.06    100.00      0.00     0.00
#> "in_dir"                        0.06    100.00      0.00     0.00
#> "in_input_dir"                  0.06    100.00      0.00     0.00
#> "knitr::knit"                   0.06    100.00      0.00     0.00
#> "map_"                          0.06    100.00      0.00     0.00
#> "process_file"                  0.06    100.00      0.00     0.00
#> "process_group.block"           0.06    100.00      0.00     0.00
#> "process_group"                 0.06    100.00      0.00     0.00
#> "purrr::map"                    0.06    100.00      0.00     0.00
#> "rmarkdown::render"             0.06    100.00      0.00     0.00
#> "saveRDS"                       0.06    100.00      0.00     0.00
#> "timing_fn"                     0.06    100.00      0.00     0.00
#> "tokens_proximity"              0.06    100.00      0.00     0.00
#> "try"                           0.06    100.00      0.00     0.00
#> "tryCatch"                      0.06    100.00      0.00     0.00
#> "tryCatchList"                  0.06    100.00      0.00     0.00
#> "tryCatchOne"                   0.06    100.00      0.00     0.00
#> "with_indexed_errors"           0.06    100.00      0.00     0.00
#> "withCallingHandlers"           0.06    100.00      0.00     0.00
#> "withVisible"                   0.06    100.00      0.00     0.00
#> "as.vector"                     0.02     33.33      0.02    33.33
#> "FUN"                           0.02     33.33      0.02    33.33
#> "as.data.frame.matrix"          0.02     33.33      0.00     0.00
#> "as.data.frame"                 0.02     33.33      0.00     0.00
#> "lapply"                        0.02     33.33      0.00     0.00
#> "sapply"                        0.02     33.33      0.00     0.00
#>
#> $sample.interval
#> [1] 0.02
#>
#> $sampling.time
#> [1] 0.06

Created on 2023-11-16 with reprex v2.0.2

schochastics commented 8 months ago

Ok this is a fun exercise. The most optimized C++ I could come up with.

#include <Rcpp.h>
#include <algorithm>
// [[Rcpp::export]]
Rcpp::NumericVector rowMins_optim(const Rcpp::NumericMatrix& mat) {
    int nRows = mat.nrow();
    int nCols = mat.ncol();
    Rcpp::NumericVector mins(nRows);

    for(int i = 0; i < nRows; ++i) {
        double minValue = mat(i, 0);

        for(int j = 1; j < nCols; ++j) {
            minValue = std::min(minValue, mat(i, j));
        }

        mins[i] = minValue;
    }
    return mins;
}
m <- matrix(sample(1:20, 50000, replace = TRUE), ncol = 5)
res <- bench::mark(
    apply = apply(m, 1, min),
    docall = do.call(pmin, as.data.frame(m)),
    Rcpp_optim = rowMins_optim(m)
)
res
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 apply        4.75ms   4.88ms      203.     391KB     55.7
#> 2 docall     184.79µs 190.39µs     5024.     511KB     41.4
#> 3 Rcpp_optim  36.49µs  40.31µs    24519.     471KB    221.

Would be quite a speedup but not sure if this justifies making the pkg a compiled one?

chainsawriot commented 8 months ago

@schochastics Let's make the pkg a compiled one. It's like inevitable these days.

(I am cracking dfm() and I think for that one the only viable solution is C++.)

schochastics commented 8 months ago

Ok I can prepare a PR making the necessary changes

chainsawriot commented 8 months ago

Another bottleneck; the memory footprint is also kind of unacceptable.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
data_corpus_inaugural %>% tokens %>% tokens_tolower %>% tokens_compound(data_dictionary_LSD2015, concatenator = " ") -> toks
toks %>% tokens_proximity(c("ameri*")) -> tokp
bench::mark(dfm(tokp), dfm(toks), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dfm(tokp)   557.4ms  557.4ms      1.79    1.11GB     46.6
#> 2 dfm(toks)    13.3ms   13.9ms     49.9     11.1MB     16.0

Created on 2023-11-16 with reprex v2.0.2

Arguably dfm(tokp) is a slightly more difficult task than dfm(toks); and the original dfm() is muli-threaded (with RcppParallel (3.x) or TBB (4.x) in C++ code). But the difference is too great.

schochastics commented 8 months ago

with #26

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  32.8ms 35.4ms      23.5    91.6MB     80.2

Created on 2023-11-17 with reprex v2.0.2

also, for future reference see my blog and a discussion on mastodon

chainsawriot commented 7 months ago

with #26

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  32.8ms 35.4ms      23.5    91.6MB     80.2

Created on 2023-11-17 with reprex v2.0.2

also, for future reference see my blog and a discussion on mastodon

With #44

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  56.9ms 58.8ms      12.9    98.9MB     47.9

Created on 2023-11-21 with reprex v2.0.2

The running time is 2x. But I think there is room for optz.

https://github.com/gesistsa/quanteda.proximity/blob/62726674a816a835d3a864fdb9f6ac9df39adce0/R/get_dist.R#L25-L38

These lines are pure R.