Closed batmanscode closed 2 years ago
Hi @batmanscode Thank you for your interest to my package!
First, every topic model in tomotopy
has its own internal document type. You can create and add a document suitable for each model through each model's add_doc
method. So far it's good.
However, trying to add the same list of documents to different models becomes quite inconvenient, because you should add_doc
for the same list of documents to each different model.
Thus, tomotopy
provides Corpus
class that holds a list of documents. You can insert this Corpus
into a model using passing it as argument corpus
to __init__
or add_corpus
method of each model. So, inserting Corpus
just has the same effect to inserting documents the Corpus holds.
Last, the argument transform
is a bit tricky. Some topic models requires different data for its documents. For example, DMRModel
requires argument metadata
in str
type, but PLDAModel
requires argument labels
in List[str]
type. Since Corpus
holds an independent set of documents rather than being tied to a specific topic model, data types required by a topic model may be inconsistent when you add a corpus into that topic model. In this case, you can transform miscellaneous data to be fitted your topic model using argument transform
. Following example maybe helps you:
from tomotopy import DMRModel
from tomotopy.utils import Corpus
corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)
model = DMRModel(k=10)
model.add_corpus(corpus)
# You lose `a_data` field in `corpus`,
# and `metadata` that `DMRModel` requires is filled with default value, empty str.
assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''
def transform_a_data_to_metadata(misc: dict):
return {'metadata': str(misc['a_data'])}
# this function transforms `a_data` to `metadata`
model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.
assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'
Thank you @bab2min that was a very thorough explanation! Very much appreciated 😊. I think corpus might come in handy as I play around with more models.
Please consider adding your explanation and example to the docs. I think that would be helpful. Maybe you can put it right before "Parallel Sampling Algorithms".
@batmanscode Thank you for the good suggestion. I'll insert it into documentation at the next update!
Hi, this is a wonderful library! Very grateful that you've put in the time to create and maintain this. Thank you 😃
Forgive my ignorance, but besides the
transform
parameter, what is the difference betweenadd_corpus
andadd_doc
?