ericproffitt / TopicModelsVB.jl

A Julia package for variational Bayesian topic modeling.
Other
81 stars 8 forks source link

Does it have the functionality to reduce vocabulary size? #26

Closed ValeriiBaidin closed 4 years ago

ValeriiBaidin commented 4 years ago

Does it have the functionality to reduce vocabulary size? For instance, keep only the top 10000 words. OR Remove words incurred less than K documents.

Thank you in advance.

ericproffitt commented 4 years ago

So there isn't currently functionality for keeping the top N words. However you can remove all words that appear less than N times in the corpus by doing,

abridge_corp!(corp, N)
fixcorp!(corp, trim=true)

or more succinctly,

fixcorp!(corp, abridge=N, trim=true)