Open ValeriiBaidin opened 4 years ago
This type of functionality may be quite difficult to hack into the model.
Could you describe in more detail exactly what you are trying to accomplish?
Depending on what your goal is, adding functionality for it may be fairly straightforward.
Are you using minibatches so that the global parameters (e.g. alpha and beta for LDA) of the model are updated more frequently?
Changing the corpus of a model is not advisable, as the shape and ordering of the model parameters depend on the corpus vocabulary and both document order and structure. The better way would be for me to add functionality internal to the model so that you don't need to alter the corpus to achieve your goal.
For example,
If you wanted to create a new LDA model and, prior to training it, initialize it with the global parameters of an old LDA model, you could do the following,
lda_model_new.alpha = lda_model_old.alpha
lda_model_new.beta = lda_model_old.beta
Critically, this will depend on both your old and new models having the same corpus vocabularly and number of topics.
However like I said above, this may not be sufficient for your goals, and it would be better to handle such things internally.
This type of functionality may be quite difficult to hack into the model.
Could you describe in more detail exactly what you are trying to accomplish?
Depending on what your goal is, adding functionality for it may be fairly straightforward.
Are you using minibatches so that the global parameters (e.g. alpha and beta for LDA) of the model are updated more frequently?
Changing the corpus of a model is not advisable, as the shape and ordering of the model parameters depend on the corpus vocabulary and both document order and structure. The better way would be for me to add functionality internal to the model so that you don't need to alter the corpus to achieve your goal.
Thank you for quick response (as usual). I am sorry to bother you.
The dataset is too large to be estimated in one step. I don't have enough memory.
So I would like to run a model many times in a random subset. So it would be some kind of bootstrap.
Yes, the number of topics and vocabulary is the same.
I was trying your example exempt I forget about old_alpha. I will try it.
thank you so much
Your corpus must be quite large, as the LDA model has been memory optimized.
Out of personal curiosity, how many documents are in your corpus?
As for your specific problem, if it's the model that is too large for memory, then you may consider trying a loop like this,
alpha = ones(K);
beta = ones(K, size(corp)[2]) / size(corp)[2];
batch_indices_partition = [1:2500, 2501:5000];
for indices in batch_indices_partition
global alpha = alpha
global beta = beta
minicorp = copy(corp)
minicorp.docs = minicorp[indices]
model = LDA(minicorp, K)
model.alpha = alpha
model.beta = beta
train!(model, kwargs...)
alpha = model.alpha
beta = model.beta
end
However if it's the corpus, then you will need to include some code to read portions of the corp at a time. Furthermore, this will only loop over your corpus once, and if you loop over it again, local parameter data will be reset.
Since you can't even load your full corpus or model into memory, this will not be easy for me to implement, since it will likely require streaming data from the disk while the model is running.
Your corpus must be quite large, as the LDA model has been memory optimized.
Out of personal curiosity, how many documents are in your corpus?
As for your specific problem, if it's the model that is too large for memory, then you may consider trying a loop like this,
alpha = ones(K); beta = ones(K, size(corp)[2]) / size(corp)[2]; batch_indices_partition = [1:2500, 2501:5000]; for indices in batch_indices_partition global alpha = alpha global beta = beta minicorp = copy(corp) minicorp.docs = minicorp[indices] model = LDA(minicorp, K) model.alpha = alpha model.beta = beta train!(model, kwargs...) alpha = model.alpha beta = model.beta end
However if it's the corpus, then you will need to include some code to read portions of the corp at a time. Furthermore, this will only loop over your corpus once, and if you loop over it again, local parameter data will be reset.
Since you can't even load your full corpus or model into memory, this will not be easy for me to implement, since it will likely require streaming data from the disk while the model is running.
Thank you, I will do some experiments.
I have around 1mil docs.
Is it possible to add new data, replace data, and retrain the model.
I try to change the corpus of the model, and const M,N,C but I have problem to change model.Elogtheta
Or I can ask in another way. How to take initial values from other the model?
P.S. I need to reproduce some kind of training on minibatch.
thank you so much.