bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

[QUESTION] What's the difference between add_corpus and add_doc? #129

Closed batmanscode closed 2 years ago

batmanscode commented 3 years ago

Hi, this is a wonderful library! Very grateful that you've put in the time to create and maintain this. Thank you 😃

Forgive my ignorance, but besides the transform parameter, what is the difference between add_corpus and add_doc?

bab2min commented 3 years ago

Hi @batmanscode Thank you for your interest to my package!

First, every topic model in tomotopy has its own internal document type. You can create and add a document suitable for each model through each model's add_doc method. So far it's good. However, trying to add the same list of documents to different models becomes quite inconvenient, because you should add_doc for the same list of documents to each different model. Thus, tomotopy provides Corpus class that holds a list of documents. You can insert this Corpus into a model using passing it as argument corpus to __init__ or add_corpus method of each model. So, inserting Corpus just has the same effect to inserting documents the Corpus holds.

Last, the argument transform is a bit tricky. Some topic models requires different data for its documents. For example, DMRModel requires argument metadata in str type, but PLDAModel requires argument labels in List[str] type. Since Corpus holds an independent set of documents rather than being tied to a specific topic model, data types required by a topic model may be inconsistent when you add a corpus into that topic model. In this case, you can transform miscellaneous data to be fitted your topic model using argument transform. Following example maybe helps you:

from tomotopy import DMRModel
from tomotopy.utils import Corpus

corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)

model = DMRModel(k=10)
model.add_corpus(corpus) 
# You lose `a_data` field in `corpus`, 
# and `metadata` that `DMRModel` requires is filled with default value, empty str.

assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''

def transform_a_data_to_metadata(misc: dict):
  return {'metadata': str(misc['a_data'])}
# this function transforms `a_data` to `metadata`

model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.

assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'
batmanscode commented 3 years ago

Thank you @bab2min that was a very thorough explanation! Very much appreciated 😊. I think corpus might come in handy as I play around with more models.

Please consider adding your explanation and example to the docs. I think that would be helpful. Maybe you can put it right before "Parallel Sampling Algorithms".

bab2min commented 2 years ago

@batmanscode Thank you for the good suggestion. I'll insert it into documentation at the next update!