Should a token's proximity to itself be 1 or 0?

kbenoit commented 1 year ago

Right now, it's 1, and the token adjacent to it is 2. Seems like these should be 0 and 1.

library("quanteda")
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
library("quanteda.proximity")
toks <- tokens(c(d1 = "a b c d e", d2 = "c d e"))
toksp <- tokens_proximity(toks, "b")
toksp$proximity
#> $d1
#> [1] 2 1 2 3 4
#> 
#> $d2
#> [1] 4 4 4

And this could be interpreted as inconsistent if there are multiple matches, since adjacent tokens are now 1 from each other:

> tokens_proximity(toks, pattern = "b|c", valuetype = "regex")$proximity
$d1
[1] 2 1 1 2 3

$d2
[1] 1 2 3

^{Created on 2023-11-17 with reprex v2.0.2}

chainsawriot commented 1 year ago

@kbenoit Thank you for the suggestion. There is a reason for starting counting from 1; that's because a number in the DFM is the sum of (1/proximity) by default. And of course, 1/0 is Inf.

One can either change the weight_function for dfm(), or change count_from for tokens_proximity().

library(quanteda); library(quanteda.proximity)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(c(d1 = "a b c d e", d2 = "c d e"))
toksp <- tokens_proximity(toks, "b", count_from = 0)
toksp$proximity
#> $d1
#> [1] 1 0 1 2 3
#> 
#> $d2
#> [1] 3 3 3

^{Created on 2023-11-17 with reprex v2.0.2}

When get_min (get the row minimum) is FALSE, it gives a matrix (I realize now, the columns should be named; and consistent in the number of columns. I admit that that I didn't pay enough attention to that in the development so far). As explained in the documentation, the numbers in the matrix won't add count_from to them.

library(quanteda); library(quanteda.proximity)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(c(d1 = "a b c d e", d2 = "c d e"))
toksp <- tokens_proximity(toks, pattern = "b|c", valuetype = "regex", get_min = FALSE)
toksp$proximity
#> $d1
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    0    1
#> [3,]    1    0
#> [4,]    2    1
#> [5,]    3    2
#> 
#> $d2
#>      [,1]
#> [1,]    0
#> [2,]    1
#> [3,]    2

^{Created on 2023-11-17 with reprex v2.0.2}

chainsawriot commented 1 year ago

[ ] Name the columns in the matrix, when get_min is FALSE
[ ] Make it consistent?
[ ] Do we actually need get_min = FALSE?

gesistsa / quanteda.proximity

Should a token's proximity to itself be 1 or 0? #34