Use quanteda::index() ref #38

chainsawriot commented 10 months ago

20x slower than #26 recorded in #20 by @schochastics

Several possibilities

[x] Do we need to make index for every document?
[x] Where are the bottlenecks?

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"   398ms  428ms      2.34     167MB     18.7

^{Created on 2023-11-21 with reprex v2.0.2}

chainsawriot commented 10 months ago

verdammt

https://github.com/gesistsa/quanteda.proximity/blob/bf530b47130011249559d46b8a217aae00f7fe80/R/get_dist.R#L12

chainsawriot commented 10 months ago

051869f is 3x slower

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  94.7ms  107ms      7.65     159MB     44.0

^{Created on 2023-11-21 with reprex v2.0.2}

chainsawriot commented 10 months ago

789d1fb 2x

Given this introduces more functionalities (phrase etc), I think it should be enough (although further optz is certainly possible).

require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  53.1ms 63.2ms      13.1    98.9MB     56.0
bench::mark(quanteda::index(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "quanteda::index(toks, c(\"a\"))"   5.22ms 5.59ms      175.    2.22MB     13.3

^{Created on 2023-11-21 with reprex v2.0.2}

gesistsa / quanteda.proximity

Use quanteda::index() ref #38 #44