Open bab2min opened 3 years ago
@MarkWClements
You can provide uid
as optional argument to Corpus.add_doc
like followings:
corpus = tp.utils.Corpus()
corpus.add_doc(some_words, uid="doc1")
corpus.add_doc(some_words, uid="doc2")
corpus.add_doc(some_words, uid="doc3")
I'll supplement the documentation about this.
Is there a way to add a uid after the model is already trained to the existing documents in the trained model or do I have to re-train the model with this feature. Also, do the documents persist in the same order in which they are fed into this:
corpus = tp.utils.Corpus()
for doc in docs:
corpus.add_doc(words=doc)
That is when I call
trained_docs = lda.docs
is trained_docs[n]
the same document as docs[n]
? I can manually add labels later if this is the case, I just want to make sure the document order is preserved in training the model.
Thanks
Hi @MarkWClements
Currently, there is no feature about modifying uid
. I'll add it to future development features.
Usually, trained_docs[n]
is the same document as docs[n]
, except a few case where corpus
has unsupported documents (e.g. documents with no word). You can check it by test their length: len(trained_docs) == len(docs)
.
If len(trained_docs)
is different from len(docs)
, it means there are some errors in pushing documents of docs
into lda
model and some of them are missing.
In current version, errors or warnings related to inserting corpus into models are not clearly displayed, but I will improve it later patch.
I am having an issue when training an LDA model, I get 'uid' values of
''
for all documents. I also don't see any option to provide document ids to the Corpus as you mention here. Is there the capability of including user defined document ids?Originally posted by @MarkWClements in https://github.com/bab2min/tomotopy/issues/62#issuecomment-909785385