dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
851 stars 136 forks source link

Storing Doc Topic Distribution with LDA model #255

Open mmantyla opened 6 years ago

mmantyla commented 6 years ago

This is mostly annoyance. I think it would be logical if the lda_model would also store the resulting doc_topic_distr as part of the public fields.

doc_topic_distr = lda_model$fit_transform(x = dtm, n_iter = 1000, convergence_tol = 0.001, n_check_convergence = 25, progressbar = FALSE)

We can see that topic_word_distribution is already there so having doc_topic_distribution would make sense as well. Or have I misunderstood something.

`` > lda_model

Inherits from: Public: clone: function (deep = FALSE) components: active binding fit_transform: function (x, n_iter = 1000, convergence_tol = 0.001, n_check_convergence = 10, get_top_words: function (n = 10, topic_number = 1L:private$n_topics, lambda = 1) initialize: function (n_topics = 10L, doc_topic_prior = 50/n_topics, topic_word_prior = 1/n_topics) plot: function (lambda.step = 0.1, reorder.topics = FALSE, doc_len = private$doc_len, topic_word_distribution: active binding transform: function (x, n_iter = 1000, convergence_tol = 0.001, n_check_convergence = 5, ``
dselivanov commented 6 years ago

topic_word_distribution can be considered as "fixed" after model fitted. doc_topic_distr however depends on the input data and will different during inference.

mmantyla commented 6 years ago

Sure. In my course only one run is done after which the model is saved for further analysis. However, several models from different data set are done but all with one run. Now saving each of them requires that two different objects are saved. With topicmodels package saving one model was enough.

dselivanov commented 6 years ago

I will make it optional. Now we store it internally anyway (but this is not desirable because serialized model is huge).