ContinuumIO / topik

A Topic Modeling toolbox
BSD 3-Clause "New" or "Revised" License
93 stars 24 forks source link

fix Term Frequency input to pyLDAvis #34

Closed youngblood closed 9 years ago

youngblood commented 9 years ago

Currently I think we are using 'the number of documents in which a term appears'. I think we should instead be using 'total number of occurrences of a term in the entire corpus'. Ideally this value will be calculated once and then stored for each term in the intermediate data store.

msarahan commented 9 years ago

I think this issue is related (reported by @AHMcKenzie):

With the proper Topik version the sample demo worked fine. However I get the error below when trying to obtain a ldavis plot. I noticed that this was flagged a couple of days ago, so I'll wait to see what's the outcome of that fix. Thanks and regards

ValidationError Traceback (most recent call last) in () ----> 1 plot_lda_vis(model.to_py_lda_vis())

/Users/alexmckenzie/anaconda/lib/python2.7/site-packages/topik/viz.pyc in plot_lda_vis(model_data) 65 """Designed to work with to_py_lda_vis() in the model classes.""" 66 from pyLDAvis import prepare, show ---> 67 model_vis_data = prepare(**model_data) 68 show(model_vis_data)

/Users/alexmckenzie/anaconda/lib/python2.7/site-packages/pyLDAvis/_prepare.pyc in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts) 277 doc_lengths = _series_with_name(doc_lengths, 'doc_length') 278 vocab = _series_with_name(vocab, 'vocab') --> 279 _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency) 280 R = min(R, len(vocab)) 281

/Users/alexmckenzie/anaconda/lib/python2.7/site-packages/pyLDAvis/_prepare.pyc in _input_validate(args) 57 res = _input_check(args) 58 if res: ---> 59 raise ValidationError('\n' + '\n'.join([' * ' + s for s in res])) 60 61

ValidationError: