bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
560 stars 63 forks source link

Convert a document-by-term frequency matrix to corpus class #118

Open DongqingSun96 opened 3 years ago

DongqingSun96 commented 3 years ago

Hi,

As I know, genism provides a function Sparse2Corpus to convert sparse matrix to Gensim corpus format. Is there a similar function in tomotopy which can convert a document-by-term matrix to corpus class in tomotopy?

Thanks.

bab2min commented 3 years ago

@DongqingSun96, Oops, sorry for late answer. I forgot this issue totally. Currently, tomotopy doesn't provide such function, because its internal implementation cannot accept a matrix in bag-of-words format. To insert sparse matrix into tomotopys corpus at current version, you should restore the word list from the matrix, and call add_doc() repeatedly.

If utility functions like Dense2Corpus or Sparse2Corpus are needed often, I can improve tomotopy.corpus to accept matrix input by modifying its internal implementation. But it takes some times.

bab2min commented 3 years ago

Adding new features into tomotopy.utils.Corpus constructing from matrix:

new features constructing matrix from Corpus:

Exposing vocab dict property of Corpus