bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
557 stars 62 forks source link

Problems with empty uid #144

Open bab2min opened 3 years ago

bab2min commented 3 years ago

@Jurgita-DS Aha, you created without uid param. I'll check it. Thank you!

I am having an issue when training an LDA model, I get 'uid' values of '' for all documents. I also don't see any option to provide document ids to the Corpus as you mention here. Is there the capability of including user defined document ids?

Originally posted by @MarkWClements in https://github.com/bab2min/tomotopy/issues/62#issuecomment-909785385

bab2min commented 3 years ago

@MarkWClements You can provide uid as optional argument to Corpus.add_doc like followings:

corpus = tp.utils.Corpus()
corpus.add_doc(some_words, uid="doc1")
corpus.add_doc(some_words, uid="doc2")
corpus.add_doc(some_words, uid="doc3")

I'll supplement the documentation about this.

MarkWClements-zz commented 2 years ago

Is there a way to add a uid after the model is already trained to the existing documents in the trained model or do I have to re-train the model with this feature. Also, do the documents persist in the same order in which they are fed into this:

corpus = tp.utils.Corpus()
for doc in docs:
        corpus.add_doc(words=doc)

That is when I call

trained_docs = lda.docs

is trained_docs[n] the same document as docs[n]? I can manually add labels later if this is the case, I just want to make sure the document order is preserved in training the model.

Thanks

bab2min commented 2 years ago

Hi @MarkWClements

  1. Currently, there is no feature about modifying uid. I'll add it to future development features.

  2. Usually, trained_docs[n] is the same document as docs[n], except a few case where corpus has unsupported documents (e.g. documents with no word). You can check it by test their length: len(trained_docs) == len(docs). If len(trained_docs) is different from len(docs), it means there are some errors in pushing documents of docs into lda model and some of them are missing.

In current version, errors or warnings related to inserting corpus into models are not clearly displayed, but I will improve it later patch.