bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link
correlated-topic-model dirichlet-multinomial-regression hierarchical-dirichlet-processes latent-dirichlet-allocation nlp pachinko-allocation python-library supervised-lda topic-modeling topic-models

tomotopy

.. image:: https://badge.fury.io/py/tomotopy.svg :target: https://pypi.python.org/pypi/tomotopy

.. image:: https://zenodo.org/badge/186155463.svg :target: https://zenodo.org/badge/latestdoi/186155463

🎌 English, 한국어_.

.. _한국어: README.kr.rst

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

Please visit https://bab2min.github.io/tomotopy to see more information.

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/) ::

$ pip install --upgrade pip
$ pip install tomotopy

The supported OS and Python versions are:

After installing, you can start tomotopy by just importing. ::

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from 'sample.txt' file. ::

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

mdl.summary()

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that gensim's LdaModel_ uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

.. _gensim's LdaModel: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model's running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

.. image:: https://bab2min.github.io/tomotopy/images/tmt_i5.png

Performance in Intel i5-6600, x86-64 (4 cores)

.. image:: https://bab2min.github.io/tomotopy/images/tmt_xeon.png

Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models' result.

.. image:: https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

.. image:: https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file. ::

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model, 
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

::

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

::

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

::

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

Corpus and transform

Every topic model in tomotopy has its own internal document type. A document can be created and added into suitable for each model through each model's add_doc method. However, trying to add the same list of documents to different models becomes quite inconvenient, because add_doc should be called for the same list of documents to each different model. Thus, tomotopy provides tomotopy.utils.Corpus class that holds a list of documents. tomotopy.utils.Corpus can be inserted into any model by passing as argument corpus to __init__ or add_corpus method of each model. So, inserting tomotopy.utils.Corpus just has the same effect to inserting documents the corpus holds.

Some topic models requires different data for its documents. For example, tomotopy.DMRModel requires argument metadata in str type, but tomotopy.PLDAModel requires argument labels in List[str] type. Since tomotopy.utils.Corpus holds an independent set of documents rather than being tied to a specific topic model, data types required by a topic model may be inconsistent when a corpus is added into that topic model. In this case, miscellaneous data can be transformed to be fitted target topic model using argument transform. See more details in the following code:

::

from tomotopy import DMRModel
from tomotopy.utils import Corpus

corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)

model = DMRModel(k=10)
model.add_corpus(corpus) 
# You lose `a_data` field in `corpus`, 
# and `metadata` that `DMRModel` requires is filled with the default value, empty str.

assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''

def transform_a_data_to_metadata(misc: dict):
    return {'metadata': str(misc['a_data'])}
# this function transforms `a_data` to `metadata`

model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.

assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'

Parallel Sampling Algorithms

Since version 0.5.0, tomotopy allows you to choose a parallelism algorithm. The algorithm provided in versions prior to 0.4.2 is COPY_MERGE, which is provided for all topic models. The new algorithm PARTITION, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

Performance by Version

Performance changes by version are shown in the following graph. The time it takes to run the LDA model train with 1000 iteration was measured. (Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)

.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t1.png

.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t4.png

.. image:: https://bab2min.github.io/tomotopy/images/lda-perf-t8.png

Pining Topics using Word Priors

Since version 0.6.0, a new method tomotopy.LDAModel.set_word_prior has been added. It allows you to control word prior for each topic. For example, we can set the weight of the word 'church' to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes. This means that the probability that the word 'church' is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic. Therefore, most of 'church' is assigned to topic 0, so topic 0 contains many words related to 'church'. This allows to manipulate some topics to be placed at a specific topic number.

::

import tomotopy as tp
mdl = tp.LDAModel(k=20)

# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See word_prior_example in example.py for more details.

Examples

You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

.. _EigenRand: https://github.com/bab2min/EigenRand

Bindings for Other Languages

Bundled Libraries and Their License

Citation

::

@software{minchul_lee_2022_6868418,
  author       = {Minchul Lee},
  title        = {bab2min/tomotopy: 0.12.3},
  month        = jul,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v0.12.3},
  doi          = {10.5281/zenodo.6868418},
  url          = {https://doi.org/10.5281/zenodo.6868418}
}