gesistsa / quanteda.proximity

📐 Proximity-based Weighting Scheme for the Quantitative Analysis of Textual Data
GNU General Public License v3.0
4 stars 0 forks source link

Alternative idea: just one main function: `dfm_proximity` #33

Closed chainsawriot closed 1 year ago

chainsawriot commented 1 year ago

tokens_proximity()'s output is not very useful analytically. Also, the S3 object is vulnerable to many further steps (tokens_*()) one can do after tokens_proximity(). The proximity vectors need to be recalculated if the tokens are changed, as simple as changing the case.

An alternative would be to let the user do whatever tokens manipulation tasks after the creation of the tokens() object without considering the proximity vectors. The proximity vectors only come to play during the creation of dfm; maybe with just one function dfm_proximity(). This way, we don't need to keep track of the proximity vectors.

Of course, we could provide a function to calculate the proximity vectors for the tokens object.

chainsawriot commented 1 year ago

15

27

chainsawriot commented 1 year ago
require(quanteda); require(quanteda.proximity)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> Loading required package: quanteda.proximity
"A a B b" %>% tokens %>% tokens_proximity("b") %>% tokens_tolower() %>%
    docvars("proximity") ## Not correct
#> $text1
#> [1] 4 3 2 1
"A a B b" %>% tokens %>% tokens_proximity("b") %>% tokens_tolower() %>%
    tokens_proximity("b")  %>% docvars("proximity") ## correct but need to do it every time
#> $text1
#> [1] 3 2 1 1

Created on 2023-11-17 with reprex v2.0.2

kbenoit commented 1 year ago

What does a 0 mean in a dfm, when the token was not present in the original document? dfm's cannot have NAs. See #34 for what a 0 could/should mean.

kbenoit commented 1 year ago

Also there is a problem with the quanteda grammar if there is a single function, since dfm_*() must input and output a dfm.