Closed chainsawriot closed 1 year ago
require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
"A a B b" %>% tokens %>% tokens_proximity("b") %>% tokens_tolower() %>%
docvars("proximity") ## Not correct
#> $text1
#> [1] 4 3 2 1
"A a B b" %>% tokens %>% tokens_proximity("b") %>% tokens_tolower() %>%
tokens_proximity("b") %>% docvars("proximity") ## correct but need to do it every time
#> $text1
#> [1] 3 2 1 1
Created on 2023-11-17 with reprex v2.0.2
What does a 0 mean in a dfm, when the token was not present in the original document? dfm's cannot have NAs. See #34 for what a 0 could/should mean.
Also there is a problem with the quanteda grammar if there is a single function, since dfm_*()
must input and output a dfm.
tokens_proximity()
's output is not very useful analytically. Also, the S3 object is vulnerable to many further steps (tokens_*()
) one can do aftertokens_proximity()
. The proximity vectors need to be recalculated if the tokens are changed, as simple as changing the case.An alternative would be to let the user do whatever tokens manipulation tasks after the creation of the
tokens()
object without considering the proximity vectors. The proximity vectors only come to play during the creation ofdfm
; maybe with just one functiondfm_proximity()
. This way, we don't need to keep track of the proximity vectors.Of course, we could provide a function to calculate the proximity vectors for the
tokens
object.