chainsawriot / rectr

💒 Reproducible Extraction of Cross-lingual Topics using R
GNU Lesser General Public License v2.1
20 stars 5 forks source link

Obtaining fine-grinded topics? #2

Closed justinchuntingho closed 3 years ago

justinchuntingho commented 4 years ago

First of all, thanks for the groundbreak work!

I was trying to get more fine grinded topics with higher k both on the wiki corpus and the corpus of my own (around 60000 news article). However, even though I have set a higher k, I constantly get the following warning message:

Warning message:
In calculate_gmm(wiki_dfm_filtered, seed = 46709394) :
  Cannot converge with a model with k = 20.  Actual k = 3

I am not sure if the low number of topics is due to the dimension reduction in step 3 (filter_dfm()) or the GMM algorithm. I did try to change the multiplication_factor variable to retain more dimensions, but the result is no better.

Here's a reproducible example:

library("rectr")

wiki_corpus <- create_corpus(wiki$content, wiki$lang)
wiki_dfm <- transform_dfm_boe(wiki_corpus, noise = TRUE)
wiki_dfm

wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 20, multiplication_factor = 2)
wiki_dfm_filtered

wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm

Here's what I get:

20-topic rectr model trained with a dfm with a dimension of 342 x 41 and de/en language(s).
Filtered with k =  20
Aligned word embeddings: bert
Defacto k = 3

sessionInfo:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rectr_0.1.3

loaded via a namespace (and not attached):
 [1] reticulate_1.16    modeltools_0.2-23  tidyselect_1.1.0   remotes_2.2.0     
 [5] purrr_0.3.4        lattice_0.20-41    colorspace_1.4-1   vctrs_0.3.4       
 [9] generics_0.0.2     testthat_2.3.2     stats4_4.0.2       usethis_1.6.3     
[13] SnowballC_0.7.0    yaml_2.2.1         rlang_0.4.7        pkgbuild_1.1.0    
[17] pillar_1.4.6       glue_1.4.2         withr_2.3.0        sessioninfo_1.1.1 
[21] lifecycle_0.2.0    munsell_0.5.0      gtable_0.3.0       mvtnorm_1.1-1     
[25] devtools_2.3.2     memoise_1.1.0      callr_3.4.4        ps_1.3.4          
[29] flexmix_2.3-15     fansi_0.4.1        tokenizers_0.2.1   Rcpp_1.0.5        
[33] scales_1.1.1       backports_1.1.10   desc_1.2.0         pkgload_1.1.0     
[37] RcppParallel_5.0.2 jsonlite_1.7.1     RSpectra_0.16-0    fs_1.5.0          
[41] fastmatch_1.1-0    stopwords_2.0      ggplot2_3.3.2      digest_0.6.25     
[45] stringi_1.5.3      processx_3.4.4     dplyr_1.0.2        quanteda_2.1.1    
[49] grid_4.0.2         rprojroot_1.3-2    cli_2.0.2          tools_4.0.2       
[53] magrittr_1.5       tibble_3.0.3       crayon_1.3.4       pkgconfig_2.0.3   
[57] ellipsis_0.3.1     Matrix_1.2-18      data.table_1.13.0  prettyunits_1.1.1 
[61] assertthat_0.2.1   rstudioapi_0.11    R6_2.4.1           nnet_7.3-14       
[65] compiler_4.0.2  

I got similar results when I tried with my own corpus.

On a remotely related note, I am wondering if it would be possible to export the filtered dfm for other algorithrms that might produce fine grind topics? For example, Gaussian LDA, which is an adaptation of LDA that takes in word embeddings (here's python implementation: https://pypi.org/project/gaussianlda/ . I can't find any in R).

Once again, thanks a lot for the great work!

chainsawriot commented 4 years ago

Make sure your corpus actually has many cross-lingual topics (i.e. topics exist in all languages). I am pretty sure the wiki corpus does not have 20. I don't know about your corpora.

In the paper, I have written that there is a possibility to converge to a solution with less than your desired k, when your corpus doesn't have enough variance to support a solution with a large k. GMM is pretty restrictive. You can't get fine-grinded topics if your input is not fine-grinded enough. (as always, GIGO).

You can access the actual matrix of a filtered dfm object by adding $dfm after it (e.g. filtered_dfm$dfm). You can then use it anywhere.

https://github.com/chainsawriot/rectr/blob/649ac80be9cbf86c1f65b3179bcb6aef1e0d9060/R/rectr.R#L182

justinchuntingho commented 4 years ago

Thanks for pointing this out! My corpus is a sample from news artcile from US and a few european countries, and I think the lack of cross-lingual topics is the reason (they could have over 20 in total but I doubt if many of them are cross-lingual across all languages).