Open kbenoit opened 1 year ago
@kbenoit Thank you for the suggestion.
There is a reason for starting counting from 1; that's because a number in the DFM is the sum of (1/proximity) by default. And of course, 1/0 is Inf
.
One can either change the weight_function
for dfm()
, or change count_from
for tokens_proximity()
.
library(quanteda); library(quanteda.proximity)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(c(d1 = "a b c d e", d2 = "c d e"))
toksp <- tokens_proximity(toks, "b", count_from = 0)
toksp$proximity
#> $d1
#> [1] 1 0 1 2 3
#>
#> $d2
#> [1] 3 3 3
Created on 2023-11-17 with reprex v2.0.2
When get_min
(get the row minimum) is FALSE, it gives a matrix (I realize now, the columns should be named; and consistent in the number of columns. I admit that that I didn't pay enough attention to that in the development so far). As explained in the documentation, the numbers in the matrix won't add count_from
to them.
library(quanteda); library(quanteda.proximity)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(c(d1 = "a b c d e", d2 = "c d e"))
toksp <- tokens_proximity(toks, pattern = "b|c", valuetype = "regex", get_min = FALSE)
toksp$proximity
#> $d1
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 0 1
#> [3,] 1 0
#> [4,] 2 1
#> [5,] 3 2
#>
#> $d2
#> [,1]
#> [1,] 0
#> [2,] 1
#> [3,] 2
Created on 2023-11-17 with reprex v2.0.2
get_min
is FALSEget_min = FALSE
?
Right now, it's 1, and the token adjacent to it is 2. Seems like these should be 0 and 1.
And this could be interpreted as inconsistent if there are multiple matches, since adjacent tokens are now 1 from each other:
Created on 2023-11-17 with reprex v2.0.2