epfl-dlab / WikiPDA

Crosslingual Topic Modeling with WikiPDA
10 stars 4 forks source link
embedding lda topic wikipedia

Crosslingual Topic Modeling with WikiPDA

Tiziano Piccardi, Robert West

We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA’s utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA’s capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning.

Full paper: https://arxiv.org/abs/2009.11207


Content of this repository:

Credits for WikiPDA-Lib and WikiPDA-HTTP-API: DanielBergThomsen

Precomputed topics distribution

You can find the pre-computed topics distribution (K=300) for 28 languages here. Based on the Wikipedia dump released June 1st, 2022.

Public Demo

We offer a public demo and APIs to test WikiPDA on a subset of languages (EN, DE, FR, IT, FA):

http://wikipda.dlab.tools/

Please note that the tool is a work in progress and not intended for large-scale data collection. For better performance, we recommend downloading the models and run the library locally.

What can I do with WikiPDA?

WikiPDA gives you a language-agnostic distribution over K topics generated with LDA. The process is unsupervised, and it allows you to discover the subjects more covered in different languages, compute the topics distance between Wikipedia editions, or get a topics vector that can be used as an input for a supervised task.

Few examples:

Examine the density of topics in different languages: Topics Density


Compute the topical distance between different Wikipedia editions:

Languages distances


Train a supervised topic model with custom labels (ORES in this example) that does not depend on the articles language:

Supervised Learning

Check the paper for more use cases: https://arxiv.org/abs/2009.11207