charles-river-analytics / figaro

Figaro Programming Language and Core Libraries
Other
756 stars 151 forks source link

Figaro on Spark #347

Closed tcfuji closed 8 years ago

tcfuji commented 9 years ago

Hi,

Would the Figaro project benefit from a scalable, distributed MCMC implementation? Spark seems to be a natural candidate for this.

This kind of work has been done using PySpark on pymc: http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/

apfeffer commented 9 years ago

Hi Ted,

This would definitely be interesting to us, but we currently have no resources or expertise to do this. Is this something you would be interested in? I’d love to talk to you if you have ideas.

Thanks,

Avi

From: Ted [mailto:notifications@github.com] Sent: Wednesday, December 31, 2014 1:56 AM To: p2t2/figaro Subject: [figaro] Figaro on Spark (#347)

Hi,

Would the Figaro project benefit from a scalable, distributed MCMC implementation? Spark seems to be a natural candidate for this.

This kind of work has been done using PySpark on pymc: http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347.

motownblue commented 9 years ago

We do expect to be looking at this in a couple months as part of a new project that will launching at that time. I also would be interested in hearing your ideas about this.

Sent from my iPhone

On Jan 2, 2015, at 5:57 PM, apfeffer notifications@github.com wrote:

Hi Ted,

This would definitely be interesting to us, but we currently have no resources or expertise to do this. Is this something you would be interested in? I’d love to talk to you if you have ideas.

Thanks,

Avi

  • + + + + + + + + + + + Avi Pfeffer Charles River Analytics Inc. apfeffer@cra.commailto:apfeffer@cra.com (617) 491-3474x513

From: Ted [mailto:notifications@github.com] Sent: Wednesday, December 31, 2014 1:56 AM To: p2t2/figaro Subject: [figaro] Figaro on Spark (#347)

Hi,

Would the Figaro project benefit from a scalable, distributed MCMC implementation? Spark seems to be a natural candidate for this.

This kind of work has been done using PySpark on pymc: http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347. — Reply to this email directly or view it on GitHub.

tcfuji commented 9 years ago

@apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.

@motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.

bruttenberg commented 9 years ago

I have some experience with Spark and could probably help out on this.

However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).

Brian

From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.

@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.

gtakata commented 9 years ago

Is this to be able to use resources more efficiently or to parallelize the computations?

If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.

What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.

One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.

Glenn


Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.com

From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

I have some experience with Spark and could probably help out on this.

However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).

Brian

From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.

@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.

nightscape commented 9 years ago

It might be worth having a look at emcee (Python) which seems to be capable of multi threading: http://dan.iel.fm/emcee/current/user/advanced/#multiprocessing

bruttenberg commented 9 years ago

With MCMC, you can still run N parallel versions of the chain and compile the samples at the end.

The real impediment to getting parallel MCMC to run on Figaro is that Figaro models are not stateless. That is, we persist the values of each element in a model inside the model class. This presents a big problem for parallel MCMC (and importance sampling for that matter). You have to either run N MCMC algorithms on N copies of you model, or remove the state from the elements and have MCMC maintain N states of the model. Both are valid options but aren’t trivial to implement.

This is also why I suggested BP as a good vehicle for parallel computation.

Brian

From: Glenn Takata [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:51 AM To: p2t2/figaro Cc: Brian Ruttenberg Subject: Re: [figaro] Figaro on Spark (#347)

Is this to be able to use resources more efficiently or to parallelize the computations?

If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.

What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.

One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.

Glenn


Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.comhttp://www.cra.com%3chttp:/www.cra.com

From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

I have some experience with Spark and could probably help out on this.

However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).

Brian

From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.

@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68709695.

apfeffer commented 9 years ago

We have already discussed removing the state from Figaro elements and replacing them with views. We had first considered it as a method of improving the code base and making it more functional, but if there’s a real need for it perhaps now is the time to do it.

From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 9:54 AM To: p2t2/figaro Cc: Avi Pfeffer Subject: Re: [figaro] Figaro on Spark (#347)

With MCMC, you can still run N parallel versions of the chain and compile the samples at the end.

The real impediment to getting parallel MCMC to run on Figaro is that Figaro models are not stateless. That is, we persist the values of each element in a model inside the model class. This presents a big problem for parallel MCMC (and importance sampling for that matter). You have to either run N MCMC algorithms on N copies of you model, or remove the state from the elements and have MCMC maintain N states of the model. Both are valid options but aren’t trivial to implement.

This is also why I suggested BP as a good vehicle for parallel computation.

Brian

From: Glenn Takata [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:51 AM To: p2t2/figaro Cc: Brian Ruttenberg Subject: Re: [figaro] Figaro on Spark (#347)

Is this to be able to use resources more efficiently or to parallelize the computations?

If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.

What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.

One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.

Glenn


Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.comhttp://www.cra.com%3chttp:/www.cra.com<http://www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com>

From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

I have some experience with Spark and could probably help out on this.

However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).

Brian

From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.

@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68709695.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68717836.

gtakata commented 9 years ago

Isn’t the point of these BigData technologies that you can have each server maintain its own state? If so, each server could run its own version of the model and avoid the statefulness problem ( assuming that they really are parallel).


Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.com

From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 9:54 AM To: p2t2/figaro Cc: Glenn Takata Subject: Re: [figaro] Figaro on Spark (#347)

With MCMC, you can still run N parallel versions of the chain and compile the samples at the end.

The real impediment to getting parallel MCMC to run on Figaro is that Figaro models are not stateless. That is, we persist the values of each element in a model inside the model class. This presents a big problem for parallel MCMC (and importance sampling for that matter). You have to either run N MCMC algorithms on N copies of you model, or remove the state from the elements and have MCMC maintain N states of the model. Both are valid options but aren’t trivial to implement.

This is also why I suggested BP as a good vehicle for parallel computation.

Brian

From: Glenn Takata [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:51 AM To: p2t2/figaro Cc: Brian Ruttenberg Subject: Re: [figaro] Figaro on Spark (#347)

Is this to be able to use resources more efficiently or to parallelize the computations?

If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.

What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.

One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.

Glenn


Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.comhttp://www.cra.com%3chttp:/www.cra.com<http://www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com>

From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

I have some experience with Spark and could probably help out on this.

However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).

Brian

From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)

@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.

@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68709695.

— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68717836.

zero323 commented 8 years ago

@tcfuji As far as I am conferenced the first step before you even start with RDDs has to be serializable element implementation.

apfeffer commented 8 years ago

Possibly subsumed by other projects

SemanticBeeng commented 7 years ago

@apfeffer - assuming you closed this because of classification to "future work". Do you see a case for a review of implementation (towards a more functional one) so that road is paved for porting to Spark?

There are a few other interested things that could be leveraged https://devblogs.nvidia.com/parallelforall/gpus-accelerate-epidemic-forecasting/

“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”

apfeffer commented 7 years ago

[https://tr.cloudmagic.com/h/v6/emailtag/tag/2.0/1497376210/e74138296a56547dcb2f1ea7c28a0a53/2/e51d88f8b33b1071f12df3e3a5852c07/2e5a00785c921006d92a54e665a10661/9efab2399c7c560b34de477b9aa0a465/newton.gif] Hi,

I'm still very interested in seeing Figaro on Spark. Unfortunately we don't have the resources to do this right now, but I would still like to see it happen somehow.

Avi

++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.com www.cra.com

On Mon, Jun 12, 2017 at 12:05 PM, SemanticBeeng notifications@github.com wrote:

@apfefferhttps://github.com/apfeffer - assuming you closed this because of classification to "future work". Do you see a case for a review of implementation (towards a more functional one) so that road is paved for porting to Spark?

There are a few other interested things that could be leveraged https://devblogs.nvidia.com/parallelforall/gpus-accelerate-epidemic-forecasting/

"Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak," he says. "With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!"

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-307836195, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd6mhcgXI_7550C0IjFCrrwbJ67wbks5sDWHVgaJpZM4DNajg.

SemanticBeeng commented 7 years ago

If at all possible to start a public project to migrate first to a more functional approach it would open the road for others.

I looked at this before but it being stateful was a roadblock in the sense that it would be a big effort and hard to tell how and what you would welcome and what not in this re-write.

Maybe you could start with your requirements and motivation for this distributed implementation?

apfeffer commented 7 years ago

[https://tr.cloudmagic.com/h/v6/emailtag/tag/2.0/1498664133/e74138296a56547dcb2f1ea7c28a0a53/2/e51d88f8b33b1071f12df3e3a5852c07/cf52311e988cf42c0edae6845796c1f1/9efab2399c7c560b34de477b9aa0a465/newton.gif] Hi,

I'm sorry about the slow response. We've had the goal of making Figaro stateless for a long time. However, it's not trivial to implement. Unfortunately, it's always fallen behind developing new functionality. Right now, we really don't have resources to do it ourselves. We'd welcome a public effort to make Figaro functional.

We're hoping to release Figaro 5.0 in August or so, after which would be a good time to take on a fundamental task like this.

A question I have for you: Do you believe a stateless implementation would be necessary for implementing Figaro on Spark? If so, please explain why.

Thanks,

Avi

++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.com www.cra.com

On Tue, Jun 13, 2017 at 2:12 PM, SemanticBeeng notifications@github.com wrote:

If at all possible to start a public project to migrate first to a more functional approach it would open the road for others.

I looked at this before but it being stateful was a roadblock in the sense that it would be a big effort and hard to tell how and what you would welcome and what not in this re-write.

Maybe you could start with your requirements and motivation for this distributed implementation?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-308202372, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkdytpnQmWifytImm0JT7CV7DRTGGfks5sDtEegaJpZM4DNajg.

SemanticBeeng commented 7 years ago

Do you believe a stateless implementation would be necessary for implementing Figaro on Spark?

Not a stateless (maybe even impossible) but a functional one in the sense that the mutable state is managed visibly/explicitly as opposed to implicitly like in OO and the state mutations are isolated into explicit effects.

The intuitive motivation for applying this powerful technique from functional programming is to ease distributed computation when running in a cluster in large (Spark) and even for running in a cluster in small (multi GPU).

When crossing JVM / worker / GPU boundaries such an implementation may be necessary so that the computation and its effects can be used without having to become bound to a location.

If this intuition applies to Figaro and its use cases would have to be validated by writing thorough specs for how we would want to use it is Spark.

The authority for how to use effects how they are implemented in Scala is @etorreborre. See more about this in https://github.com/atnos-org/eff.

If this makes enough sense, when Figaro 5.0 is out maybe we can write these specs for you see it used in Spark and re-evaluate?

etorreborre commented 7 years ago

There is previous work on probabilistic programming with monads. I would probably start there.

SemanticBeeng commented 7 years ago

Of course you would reference a Haskell paper, @etorreborre - the proper and best way to start. :+1: Does https://github.com/atnos-org/eff. have enough support to port this Haskell code to Scala?

The paper mentions MCMC which is used in Figaro

I will re-mention here what I said earlier because it makes the case for using GPUs https://devblogs.nvidia.com/parallelforall/gpus-accelerate-epidemic-forecasting/

“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”

etorreborre commented 7 years ago

@SemanticBeeng I have no clue because I never did probabilistic programming but it might be possible yes.

apfeffer commented 7 years ago

[https://tr.cloudmagic.com/h/v6/emailtag/tag/2.0/1499102779/e74138296a56547dcb2f1ea7c28a0a53/2/e51d88f8b33b1071f12df3e3a5852c07/1c3b1afc20960ee3bf7e4a249a399743/9efab2399c7c560b34de477b9aa0a465/newton.gif] Hi,

This is very interesting.

Let me describe the ways Figaro is currently stateful and then we can think about what to do about it.

There are two main stateful classes: Element and Universe. An Element represents a random variable; its state consists of (1) any conditions and constraints that have been added to the element; (2) its current value under a sampling process. I believe that #2 should really be moved out of Element and made a view that is part of an inference algorithm that uses those values, rather than an intrinsic property of the element. #1 does not seem as worrisome, as this is "static" and should not change at all during inference.

Universe is where I think the difficulty lies. A Universe contains a set of elements as well as information about the dependencies between the elements. During execution, elements get added to and removed from a universe. I'm having a hard time imagining how this could be done in a functional way, but this is probably my lack of imagination and experience with functional programming.

I'd appreciate any thoughts you have.

Thanks,

Avi

++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.com www.cra.com

On Thu, Jun 29, 2017 at 6:06 AM, Eric Torreborre notifications@github.com wrote:

@SemanticBeenghttps://github.com/semanticbeeng I have no clue because I never did probabilistic programming but it might be possible yes.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-311922329, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd7mECBBJkhBWfPFBre0rJ5U89xiBks5sI3c1gaJpZM4DNajg.

bruttenberg commented 7 years ago

Avi, why couldn’t you have a view on Universes? The “value” of a universe is really just the set of elements and dependencies that exists in the universe. Not saying it would be easy to implement, but conceptually it could work.

From: apfeffer [mailto:notifications@github.com] Sent: Monday, July 3, 2017 1:27 PM To: p2t2/figaro figaro@noreply.github.com Cc: Brian Ruttenberg bruttenberg@cra.com; Comment comment@noreply.github.com Subject: Re: [p2t2/figaro] Figaro on Spark (#347)

[https://tr.cloudmagic.com/h/v6/emailtag/tag/2.0/1499102779/e74138296a56547dcb2f1ea7c28a0a53/2/e51d88f8b33b1071f12df3e3a5852c07/1c3b1afc20960ee3bf7e4a249a399743/9efab2399c7c560b34de477b9aa0a465/newton.gif] Hi,

This is very interesting.

Let me describe the ways Figaro is currently stateful and then we can think about what to do about it.

There are two main stateful classes: Element and Universe. An Element represents a random variable; its state consists of (1) any conditions and constraints that have been added to the element; (2) its current value under a sampling process. I believe that #2 should really be moved out of Element and made a view that is part of an inference algorithm that uses those values, rather than an intrinsic property of the element. #1 does not seem as worrisome, as this is "static" and should not change at all during inference.

Universe is where I think the difficulty lies. A Universe contains a set of elements as well as information about the dependencies between the elements. During execution, elements get added to and removed from a universe. I'm having a hard time imagining how this could be done in a functional way, but this is probably my lack of imagination and experience with functional programming.

I'd appreciate any thoughts you have.

Thanks,

Avi

++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.commailto:apfeffer@cra.com www.cra.comhttp://www.cra.com

On Thu, Jun 29, 2017 at 6:06 AM, Eric Torreborre notifications@github.com<mailto:notifications@github.com> wrote:

@SemanticBeenghttps://github.com/semanticbeeng I have no clue because I never did probabilistic programming but it might be possible yes.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-311922329, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd7mECBBJkhBWfPFBre0rJ5U89xiBks5sI3c1gaJpZM4DNajg.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-312699177, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJOJRfkVs6HdJB8lc88LO9p4u1nfVqgks5sKSRWgaJpZM4DNajg.

etorreborre commented 7 years ago

Hi @apfeffer, the "classical" FP way to deal Universe is to make it immutable and pass it from function to function using it as a "Monad" context. So you can write something like that:

def myFunctionWithUniverse(p1: Int, p2: Int): State[Universe, Int] = 
  for {
     // get the current universe if you need it
     universe <- get[Universe]
     // modify the current universe
     _        <- put(universe.setParameter(p1 + p2))
  } yield p1 + p2

// this function uses a stateful function but doesn't access the Universe directly
def otherFunction: State[Universe, Int] =
   for {
      i <- myFunctionWithUniverse(1 , 2)
   } yield i * i

This story becomes a bit more complicated when you have other "effects" (errors for example) or concurrency to thread in.

SemanticBeeng commented 7 years ago

@apfeffer I am somewhat familiar with the code and agree that the Universe, this global state needs to be decomposed somehow.

Maybe we can start with use cases that you have in mind for using Figaro in Spark (distributed) where the current implementation would be an issue.

Another direction would be to make explicit the computations that have effects mutating the Universe and giving them business names so we can reason about them.

This mindset will become more clear if you can be kind to read about task-based UIs, event sourcing CQRS, immutable domain model, etc

https://cqrs.wordpress.com/documents/task-based-ui/

https://medium.com/technology-learning/event-sourcing-and-cqrs-a-look-at-kafka-e0c1b90d17d8

https://tech.zilverline.com/2011/02/01/towards-an-immutable-domain-model-introduction-part-1 https://tech.zilverline.com/2011/02/02/towards-an-immutable-domain-model-immutable-change-part-2 https://tech.zilverline.com/2011/02/05/towards-an-immutable-domain-model-immutability-achieved-part-3 https://tech.zilverline.com/2011/02/07/towards-an-immutable-domain-model-believe-the-type-part-4

https://tech.zilverline.com/2011/02/10/towards-an-immutable-domain-model-monads-part-5

I am ready to clarify if you have questions about the material in the links above.

Please be kind to skim and advise if

  1. the ideas above are making sense
  2. how they are applicable to Figaro and why (or why not, of course)
bruttenberg commented 6 years ago

Just FYI, this paper just showed up on Arxiv:

https://arxiv.org/pdf/1707.02047.pdf

The authors propose a PPL on top of Spark. Seems relevant.

SemanticBeeng commented 6 years ago

(a) We present the extension of Scala’s syntax that can express various sophisticated Bayesian network models with ease. (b) We present the details of compiling and executing an InferSpark program on Spark. That includes the mechanism of automatic generating efficient inference codes that include checkpointing (to avoid long lineage), proper timing of caching and anti-caching (to improve efficiency under memory constraint), and partitioning (to avoid unnecessary replication and shuffling). (c) We present an empirical study that shows InferSpark can enable statistical inference on both customized and standard models at scale.

Sounds very nice. Hope they publish code.

One of the key parts is how they implement MCMC

Parallelism in MCMC is hard because MCMC is inherently a serial algorithm https://stats.stackexchange.com/questions/204326/what-makes-parallel-distributed-probabilistic-inference-difficult-to-implement

found this also interesting https://github.com/tensorprob/tensorprob

The posterior distribution (or likelihood function) are constructed and evaluated using TensorFlow, which means you can make use of multiple CPU cores and GPUs simultaneously. This also makes it easy to add new custom probability distributions by using the symbolic operators defined in TensorFlow.

Again the GPU thing comes up.

SemanticBeeng commented 6 years ago

Found this great article studying HMM and Viterbi: https://mioalter.wordpress.com/2016/02/13/hmm-hidden-markov-models-with-figaro/

Shows how Figaro fits from a functional programming paradigm perspective. It also gives insights about elements of probabilistic programming like probability distributions and belief propagation in the context of HMM.

https://github.com/mioalter/fp-scala/blob/master/hmm-example/code/src/main/scala/hmm/examples.scala#L24

hmm.observable(1).observe(Walk)
println("After observing Walk on day 1: " + VariableElimination.probability(hmm.hidden(2), Sunny))

Further digging lead to distributed belief propagation on Spark. https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala and https://github.com/HewlettPackard/sandpiper

The project contains an implementation of Loopy Belief Propagation, a popular message passing algorithm for performing inference in probabilistic graphical models. It provides exact inference for graphical models without loops. While inference for graphical models with loops is approximate, in practice it is shown to work well. Our implementation is generic and operates on factor graph representation of graphical models. It handles factors of any order, and variable domains of any size. In addition, we provide specialized implementation for pairwise factors. The algorithm is implemented with Apache Spark GraphX, and thus can scale to large scale models . Further, it supports computations in log scale for numerical stability.

In my mind this is a great use case for how Figaro would fit with Spark and gives a good design /architecture decomposition of what it is made of in order reason about what needs to change for making it distributed.

@apfeffer - thoughts please? can we work with something like this to drive an approach?

SemanticBeeng commented 6 years ago

Found deep applications of functional programming to probabilistic programming

Effects in Bayesian Inference (video) https://www.youtube.com/watch?v=erGWMzzSUCg&list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&index=7

(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1103 applauses about use of function composition

(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1626 key insights about composing MCMC algorithms: sequential monte carlo (SMC) handler + MH handler => particle MCMC handler (!)

[Daniel Huang] Compiling Markov Chain Monte Carlo Algorithms for Probabilistic Modeling https://danehuang.github.io/papers/augurv2.pdf

video bookmark: https://youtu.be/qrpGX-ZaP6w?t=250

[Daniel Huang] An application of computable distributions to the semantics of probabilistic programming languages https://danehuang.github.io/papers/compsem.pdf

[Daniel Huang] On Programming Languages for Probabilistic Modeling https://danehuang.github.io/papers/dissertation.pdf

Thoughts on reuse in Figaro?

apfeffer commented 6 years ago

Thanks, SemanticBeeng, those are interesting links. The work by Dan Huang looks very interesting. I know his adviser Greg Morrisett quite well. Since he’s local, I might be able to hook up with him.

Also, I’ve started working on a PP language written in Haskell that might provide fertile ground for some of these ideas before incorporating into Figaro.

Avi

From: SemanticBeeng notifications@github.com Reply-To: p2t2/figaro reply@reply.github.com Date: Thursday, October 12, 2017 at 4:12 AM To: p2t2/figaro figaro@noreply.github.com Cc: Avi Pfeffer apfeffer@cra.com, Mention mention@noreply.github.com Subject: Re: [p2t2/figaro] Figaro on Spark (#347)

Found deep applications of functional programming to probabilistic programming

Effects in Bayesian Inference (video) https://www.youtube.com/watch?v=erGWMzzSUCg&list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&index=7

(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1626 key insights about composing MCMC algorithms: sequential monte carlo (SMC) handler + MH handler => particle MCMC handler (!)

[Daniel Huang] Compiling Markov Chain Monte Carlo Algorithms for Probabilistic Modeling https://danehuang.github.io/papers/augurv2.pdf

video bookmark: https://youtu.be/qrpGX-ZaP6w?t=250

[Daniel Huang] An application of computable distributions to the semantics of probabilistic programming languages https://danehuang.github.io/papers/compsem.pdf

[Daniel Huang] On Programming Languages for Probabilistic Modeling https://danehuang.github.io/papers/dissertation.pdf

These should inspire advancements in Figaro?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-336054456, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkdyqUkrGObL8nT6ozESeWNYhJWeAiks5srcn3gaJpZM4DNajg.

SemanticBeeng commented 6 years ago

Glad you like Dan's work.

"PP language written in Haskell" sounds like an excellent approach (new start?). If functional programming is used right from the core (inferences) then it should be easier to run distributed applications with it.

The world would be a better place if all previous work on #ProbabilisticProgramming in Haskell was considered and leveraged if deemed useful and if that decision making was visible publicly.

Is that public work?

Will you make it a distributed language / architecture?

This project seems to have solved some of the distributed functional programming challenges: https://github.com/transient-haskell/transient