ericproffitt / TopicModelsVB.jl

A Julia package for variational Bayesian topic modeling.
Other
81 stars 8 forks source link

Train the same model on the new data #31

Open ValeriiBaidin opened 4 years ago

ValeriiBaidin commented 4 years ago

Is it possible to add new data, replace data, and retrain the model.

I try to change the corpus of the model, and const M,N,C but I have problem to change model.Elogtheta

Or I can ask in another way. How to take initial values from other the model?

P.S. I need to reproduce some kind of training on minibatch.

thank you so much.

ericproffitt commented 4 years ago

This type of functionality may be quite difficult to hack into the model.

Could you describe in more detail exactly what you are trying to accomplish?

Depending on what your goal is, adding functionality for it may be fairly straightforward.

Are you using minibatches so that the global parameters (e.g. alpha and beta for LDA) of the model are updated more frequently?

Changing the corpus of a model is not advisable, as the shape and ordering of the model parameters depend on the corpus vocabulary and both document order and structure. The better way would be for me to add functionality internal to the model so that you don't need to alter the corpus to achieve your goal.

For example,

If you wanted to create a new LDA model and, prior to training it, initialize it with the global parameters of an old LDA model, you could do the following,

lda_model_new.alpha = lda_model_old.alpha
lda_model_new.beta = lda_model_old.beta

Critically, this will depend on both your old and new models having the same corpus vocabularly and number of topics.

However like I said above, this may not be sufficient for your goals, and it would be better to handle such things internally.

ValeriiBaidin commented 4 years ago

This type of functionality may be quite difficult to hack into the model.

Could you describe in more detail exactly what you are trying to accomplish?

Depending on what your goal is, adding functionality for it may be fairly straightforward.

Are you using minibatches so that the global parameters (e.g. alpha and beta for LDA) of the model are updated more frequently?

Changing the corpus of a model is not advisable, as the shape and ordering of the model parameters depend on the corpus vocabulary and both document order and structure. The better way would be for me to add functionality internal to the model so that you don't need to alter the corpus to achieve your goal.

Thank you for quick response (as usual). I am sorry to bother you.

The dataset is too large to be estimated in one step. I don't have enough memory.

So I would like to run a model many times in a random subset. So it would be some kind of bootstrap.

Yes, the number of topics and vocabulary is the same.

I was trying your example exempt I forget about old_alpha. I will try it.

thank you so much

ericproffitt commented 4 years ago

Your corpus must be quite large, as the LDA model has been memory optimized.

Out of personal curiosity, how many documents are in your corpus?

As for your specific problem, if it's the model that is too large for memory, then you may consider trying a loop like this,

alpha = ones(K);
beta = ones(K, size(corp)[2]) / size(corp)[2];

batch_indices_partition = [1:2500, 2501:5000];

for indices in batch_indices_partition
    global alpha = alpha
    global beta = beta

    minicorp = copy(corp)
    minicorp.docs = minicorp[indices]

    model = LDA(minicorp, K)
    model.alpha = alpha
    model.beta = beta

    train!(model, kwargs...)

    alpha = model.alpha
    beta = model.beta
end

However if it's the corpus, then you will need to include some code to read portions of the corp at a time. Furthermore, this will only loop over your corpus once, and if you loop over it again, local parameter data will be reset.

Since you can't even load your full corpus or model into memory, this will not be easy for me to implement, since it will likely require streaming data from the disk while the model is running.

ValeriiBaidin commented 4 years ago

Your corpus must be quite large, as the LDA model has been memory optimized.

Out of personal curiosity, how many documents are in your corpus?

As for your specific problem, if it's the model that is too large for memory, then you may consider trying a loop like this,

alpha = ones(K);
beta = ones(K, size(corp)[2]) / size(corp)[2];

batch_indices_partition = [1:2500, 2501:5000];

for indices in batch_indices_partition
  global alpha = alpha
  global beta = beta

  minicorp = copy(corp)
  minicorp.docs = minicorp[indices]

  model = LDA(minicorp, K)
  model.alpha = alpha
  model.beta = beta

  train!(model, kwargs...)

  alpha = model.alpha
  beta = model.beta
end

However if it's the corpus, then you will need to include some code to read portions of the corp at a time. Furthermore, this will only loop over your corpus once, and if you loop over it again, local parameter data will be reset.

Since you can't even load your full corpus or model into memory, this will not be easy for me to implement, since it will likely require streaming data from the disk while the model is running.

Thank you, I will do some experiments.

I have around 1mil docs.