gesistsa / quanteda.proximity

📐 Proximity-based Weighting Scheme for the Quantitative Analysis of Textual Data
GNU General Public License v3.0
4 stars 0 forks source link

Dependency proximity #52

Open chainsawriot opened 10 months ago

chainsawriot commented 10 months ago
require(udpipe)
#> Loading required package: udpipe
require(textplot)
#> Loading required package: textplot
##m_eng_ewt   <- udpipe_download_model(language = "english-ewt", "~/dev/misc")
## Change this
m_eng_ewt_path <- "~/dev/misc/english-ewt-ud-2.5-191206.udpipe"
m_eng_ewt_loaded <- udpipe::udpipe_load_model(file = m_eng_ewt_path)

sentence <- udpipe::udpipe_annotate(m_eng_ewt_loaded, x = "Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.") |> as.data.frame()
textplot::textplot_dependencyparser(sentence)
#> Loading required namespace: ggraph

Created on 2023-11-22 with reprex v2.0.2

chainsawriot commented 10 months ago

Use igraph to calculate the syntactic distance. (UPDATE SHOULD BE INCORRECT, e.g. distances(graph, mode = "all")[, "ROOT"])

require(textplot)
#> Loading required package: textplot
require(igraph)
#> Loading required package: igraph
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union

##m_eng_ewt   <- udpipe_download_model(language = "english-ewt", "~/dev/misc")
## Change this
m_eng_ewt_path <- "~/dev/misc/english-ewt-ud-2.5-191206.udpipe"
m_eng_ewt_loaded <- udpipe::udpipe_load_model(file = m_eng_ewt_path)

sentence <- udpipe::udpipe_annotate(m_eng_ewt_loaded, x = "Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.") |> as.data.frame()
textplot::textplot_dependencyparser(sentence)
#> Loading required namespace: ggraph


sentence[,c("token_id", "head_token_id", "token", "dep_rel")]
#>    token_id head_token_id        token   dep_rel
#> 1         1             2      Turkish      amod
#> 2         2            16    President     nsubj
#> 3         3             2       Tayyip      flat
#> 4         4             2      Erdogan      flat
#> 5         5             2            ,     punct
#> 6         6             9           in      case
#> 7         7             9          his nmod:poss
#> 8         8             9    strongest      amod
#> 9         9             2     comments      nmod
#> 10       10             9          yet    advmod
#> 11       11            14           on      case
#> 12       12            14          the       det
#> 13       13            14         Gaza  compound
#> 14       14             9     conflict      nmod
#> 15       15            16            ,     punct
#> 16       16             0         said      root
#> 17       17            18           on      case
#> 18       18            16    Wednesday       obl
#> 19       19            22          the       det
#> 20       20            22  Palestinian      amod
#> 21       21            22     militant      amod
#> 22       22            28        group     nsubj
#> 23       23            22        Hamas     appos
#> 24       24            28          was       cop
#> 25       25            28          not    advmod
#> 26       26            28            a       det
#> 27       27            28    terrorist  compound
#> 28       28            18 organisation      flat
#> 29       29            32          but        cc
#> 30       30            32            a       det
#> 31       31            32   liberation  compound
#> 32       32            18        group      conj
#> 33       33            32     fighting       acl
#> 34       34            35           to      mark
#> 35       35            33      protect     xcomp
#> 36       36            37  Palestinian      amod
#> 37       37            35        lands       obj
#> 38       38            16            .     punct

graph <- graph_from_data_frame(sentence[,c("head_token_id", "token_id")])
V(graph)$name <- c("ROOT", sentence$token)
distances(graph, mode = "all")[, "terrorist"]
#>         ROOT      Turkish    President       Tayyip      Erdogan            , 
#>            5            4            6            7            5            3 
#>           in          his    strongest     comments          yet           on 
#>            1            2            4            6            5            7 
#>          the         Gaza     conflict            ,         said           on 
#>            6            6            6            6            7            7 
#>    Wednesday          the  Palestinian     militant        group        Hamas 
#>            7            7            8            8            8            5 
#>          was          not            a    terrorist organisation          but 
#>            4            2            2            0            2            3 
#>            a   liberation        group     fighting           to      protect 
#>            3            3            3            5            5            5 
#>  Palestinian        lands            . 
#>            7            8            5

Created on 2023-11-22 with reprex v2.0.2

chainsawriot commented 10 months ago

Using the term from https://arxiv.org/pdf/1909.10171.pdf maybe it should be called dependency proximity.