Dealing with multiple model instances using EMWithVE

Kirluu commented 6 years ago

Hi :)

We are doing our master's thesis at the IT University of Copenhagen, and we have a series of questions, that we hope there exists some useful answers for :)

We are working with a setup very similar to the spam-filter application case from chapter 3 in the "Practical Probabilistic Programming" book, and our questions regard the efficiency of learning for such a model. In essence, we have several model instances, which all reference some "shared" Beta-elements for learning, which in effect results in quite a large net of connected elements. We are looking to be able to perform the learning of our Beta-elements, but without having to evaluate the entire network of connected elements at once, but instead train and learn from each individual model instance one at a time instead.

Here are some more specific questions:

Why does EMWithVE use a completely different setup (ExpectationMaximizationWithFactors) compared to the other inference algorithms when used with EM? What are the optimizations / differences that apply here - and is there some litterature that you could point us to that would help us understand some of the differences?
If we attempt use GeneralizedEM with VE, it seems that that all active elements in the universe (thereby all our connected model instances) are passed as inputs to the inference algorithm. As the amount of model instances increases, this quickly becomes infeasible for an algorithm such as VE. If we consider the spam filter case from Chapter 3, would it not be possible to use the inference algorithm on each sample separately and then combine their results during the expectation step, rather than attempting to calculate the sufficient statistics for all model instances' elements all at once? We figured that this splitting-approach might be feasible with VE (if each individual model instance is not very complex), and also have the added benefit of being parallelizable (since each sample can be reasoned about separately) if we can use StructuredVE for the task. Is there a reason why this approach is not used? Is it not feasible? If it is possible, could you provide some pointers for how we can achieve this goal?

To bring about our perspective, we are trying to optimize our training-setup for our thesis work, such that an alteration to the probabilistic model will take a little time as possible to see the effect of - both in regards to training and of course evaluation. The setup with our model instances getting tangled into each other due to the shared Beta-elements seems to meddle with the efficiency of most inference algorithms in Figaro that are usable with EM. Is there some other approach that we could go with as an alternate setup?

As another note, we believe that we are able to build our model in such a fashion that we should have little to no hidden variables (namely 0 in the learning-case, and only a single one in the evaluation-phase), which should help the efficiency of whatever inference algorithm we end up with. Also, according to litterature (https://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf), if one has no hidden variables, then you are in fact in the "complete data case", meaning that Maximum Likelihood Estimation should be feasible for the problem, namely the simple learning of the frequencies of one's dataset, rather than requiring the use of EM. Is there some way to access the MLE logic that is used as part of the EM-algorithm from somewhere in the source code?

Thanks a lot, Hoping the best, Best regards, Christian and Kenneth, Students at the IT University of Copenhagen, Denmark

apfeffer commented 6 years ago

Hi Christian and Kenneth,

I believe the reason EMWithVE has a different setup from the other algorithms is because those other algorithms are generalized for any inference algorithm that computes the distribution over the parameters, whereas EMWithVE uses the factors directly to compute the expected sufficient statistics. In other words, computing sufficient statistics is baked into VE, but not the other algorithms, so they need a different setup.

I share your concern about all the active elements being inputs to VE in every iteration. I recommend that you consider using OnlineEM, since that processes only one inference at a time. This is how we deal with large datasets.

We have not tried using Structured VE with EM. It is possible it will work fine, however this is unknown territory.

I appreciate your interest. Please let me know if this helps and if you have any other questions.

Thanks,

Avi

From: Kirluu notifications@github.com Reply-To: p2t2/figaro reply@reply.github.com Date: Wednesday, April 11, 2018 at 8:06 AM To: p2t2/figaro figaro@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [p2t2/figaro] Dealing with multiple model instances using EMWithVE (#740)

Hi :)

We are doing our master's thesis at the IT University of Copenhagen, and we have a series of questions, that we hope there exists some useful answers for :)

We are working with a setup very similar to the spam-filter application case from chapter 3 in the "Practical Probabilistic Programming" book, and our questions regard the efficiency of learning for such a model. In essence, we have several model instances, which all reference some "shared" Beta-elements for learning, which in effect results in quite a large net of connected elements. We are looking to be able to perform the learning of our Beta-elements, but without having to evaluate the entire network of connected elements at once, but instead train and learn from each individual model instance one at a time instead.

Here are some more specific questions:

Why does EMWithVE use a completely different setup (ExpectationMaximizationWithFactors) compared to the other inference algorithms when used with EM? What are the optimizations / differences that apply here - and is there some litterature that you could point us to that would help us understand some of the differences?
If we attempt use GeneralizedEM with VE, it seems that that all active elements in the universe (thereby all our connected model instances) are passed as inputs to the inference algorithm. As the amount of model instances increases, this quickly becomes infeasible for an algorithm such as VE. If we consider the spam filter case from Chapter 3, would it not be possible to use the inference algorithm on each sample separately and then combine their results during the expectation step, rather than attempting to calculate the sufficient statistics for all model instances' elements all at once? We figured that this splitting-approach might be feasible with VE (if each individual model instance is not very complex), and also have the added benefit of being parallelizable (since each sample can be reasoned about separately) if we can use StructuredVE for the task. Is there a reason why this approach is not used? Is it not feasible? If it is possible, could you provide some pointers for how we can achieve this goal?

To bring about our perspective, we are trying to optimize our training-setup for our thesis work, such that an alteration to the probabilistic model will take a little time as possible to see the effect of - both in regards to training and of course evaluation. The setup with our model instances getting tangled into each other due to the shared Beta-elements seems to meddle with the efficiency of most inference algorithms in Figaro that are usable with EM. Is there some other approach that we could go with as an alternate setup?

As another note, we believe that we are able to build our model in such a fashion that we should have little to no hidden variables (namely 0 in the learning-case, and only a single one in the evaluation-phase), which should help the efficiency of whatever inference algorithm we end up with. Also, according to litterature (https://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf https://ai.stanford.edu/%7Echuongdo/papers/em_tutorial.pdf), if one has no hidden variables, then you are in fact in the "complete data case", meaning that Maximum Likelihood Estimation should be feasible for the problem, namely the simple learning of the frequencies of one's dataset, rather than requiring the use of EM. Is there some way to access the MLE logic that is used as part of the EM-algorithm from somewhere in the source code?

Thanks a lot, Hoping the best, Best regards, Christian and Kenneth, Students at the IT University of Copenhagen, Denmark

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/740, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd0z-i9kiIiiTe2ZT-SFws4OPqjfQks5tnfHUgaJpZM4TP5xv.

Kirluu commented 6 years ago

Thank you for your reply, @apfeffer

Indeed, EMWithVE Online seems to do the trick for us, which is of course very favorable for us at this moment.

However, we took the liberty of hardcoding the seeds for randomness for the EM-setup in our local clone of the Figaro repository, and it appears that running EMWithVE versus the Online version produces quite different results. We suspect this to indeed be due to the exact same fact as the source of our concern, namely that the models being computed upon are different. (For EMWithVE, we get the one huge instance, and for Online, we get many smaller instances).

What still confuses us is that following the book's example on Online training to the letter, the same pattern - using ModelParameters - is used. This should still be delivering the exact same Beta-element to each of the instances created in the Online scenario. This must mean that there exists some logic in the Online setup that "handles" the case when learning-elements are a part of a model instance.

The question then becomes: How come this is not handled similarly in the regular EMWithVE case? Given that the book presents the pattern using ModelParameters and setting up many model-instances using these shared Beta-elements and then using the EMWithVE setup, surely this case should be handled there as well?

Maybe it is computationally infeasible to determine which elements are part of which model-instance, and if so, an explanation would be great, and otherwise, we'd simply like to know more, to better prepare ourselves for the potential questions regarding our usage of Figaro and the theory behind it.

Thank you in advance for any additional insights, Best regards, Christian and Kenneth

apfeffer commented 6 years ago

Christian and Kenneth,

I apologize, I might not be able to answer your questions here in detail, as it’s been a while since we’ve looked in detail at the EM code. Since the actual operation of the EM algorithm with VE is different, I think any differences due to different seeds might have to do with different initialization of the parameter values. It wouldn’t be surprising if EMWithVE and OnlineEM initialize these values differently. Could this be the case?

Avi

From: Kirluu notifications@github.com Reply-To: p2t2/figaro reply@reply.github.com Date: Wednesday, April 18, 2018 at 8:03 AM To: p2t2/figaro figaro@noreply.github.com Cc: Avi Pfeffer apfeffer@cra.com, Mention mention@noreply.github.com Subject: Re: [p2t2/figaro] Dealing with multiple model instances using EMWithVE (#740)

Thank you for your reply, @apfefferhttps://github.com/apfeffer

Indeed, EMWithVE Online seems to do the trick for us, which is of course very favorable for us at this moment.

However, we took the liberty of hardcoding the seeds for randomness for the EM-setup in our local clone of the Figaro repository, and it appears that running EMWithVE versus the Online version produces quite different results. We suspect this to indeed be due to the exact same fact as the source of our concern, namely that the models being computed upon are different. (For EMWithVE, we get the one huge instance, and for Online, we get many smaller instances).

What still confuses us is that following the book's example on Online training to the letter, the same pattern - using ModelParameters - is used. This should still be delivering the exact same Beta-element to each of the instances created in the Online scenario. This must mean that there exists some logic in the Online setup that "handles" the case when learning-elements are a part of a model instance.

The question then becomes: How come this is not handled similarly in the regular EMWithVE case? Given that the book presents the pattern using ModelParameters and setting up many model-instances using these shared Beta-elements and then using the EMWithVE setup, surely this case should be handled there as well?

Maybe it is computationally infeasible to determine which elements are part of which model-instance, and if so, an explanation would be great, and otherwise, we'd simply like to know more, to better prepare ourselves for the potential questions regarding our usage of Figaro and the theory behind it.

Thank you in advance for any additional insights, Best regards, Christian and Kenneth

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/740#issuecomment-382363204, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd40r650SwO70uxOZlS0AE5sXyEAAks5tpyt6gaJpZM4TP5xv.

Kirluu commented 6 years ago

Hi again @apfeffer ,

We have decided not to directly pursue the differences that we observed in the learned parameter values, as they have no major impact on the results of our project.

However, we'd still like to hear more on the question of how come Online EM is able to "cope" with the Beta(learning)-elements being shared amongst data-instances, while regular EMWithVE for instance cannot. For regular EM in Figaro, the time consumption scales not-so-well with added data, whereas Online EM indeed scales linearly - as we would expect from EM in general.

Is there some reasoning behind why the handling performed in Online is not possible for the OnePropQuery approach of the regular EM setup?

Thank you once again, Regards, Christian and Kenneth

charles-river-analytics / figaro

Dealing with multiple model instances using EMWithVE #740