gesistsa / quanteda.proximity

📐 Proximity-based Weighting Scheme for the Quantitative Analysis of Textual Data
GNU General Public License v3.0
4 stars 0 forks source link

Consider using quaneda::index() #38

Open koheiw opened 11 months ago

koheiw commented 11 months ago

I suggest you to use index() could be used to find positions of keywords including phrases.

library(quanteda.proximity)
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <-
  c("Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.",
    "EU policymakers proposed the new agency in 2021 to stop financial firms from aiding criminals and terrorists. Brussels has so far relied on national regulators with no EU authority to stop money laundering and terrorist financing running into billions of euros.")

toks <- tokens(txt) 
len <- ntoken(toks)
idx <- index(toks, pattern = phrase("Tayyip Erdogan"))
pmin(abs(seq_len(len[idx$docname]) - idx$from), abs(seq_len(len[idx$docname]) - idx$to))
#>  [1]  2  1  0  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
#> [26] 22 23 24 25 26 27 28 29 30 31 32 33 34

More generally, patters2fixed() can be used to parse patters in the same way as in quanteda.

https://github.com/gesistsa/quanteda.proximity/blob/dbd414cc7d52d389105f2b8b997c1af912ead4f9/R/get_dist.R#L28-L39

chainsawriot commented 11 months ago

Thank you very much for the suggestions @koheiw