Closed tcfuji closed 8 years ago
Hi Ted,
This would definitely be interesting to us, but we currently have no resources or expertise to do this. Is this something you would be interested in? I’d love to talk to you if you have ideas.
Thanks,
Avi
From: Ted [mailto:notifications@github.com] Sent: Wednesday, December 31, 2014 1:56 AM To: p2t2/figaro Subject: [figaro] Figaro on Spark (#347)
Hi,
Would the Figaro project benefit from a scalable, distributed MCMC implementation? Spark seems to be a natural candidate for this.
This kind of work has been done using PySpark on pymc: http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347.
We do expect to be looking at this in a couple months as part of a new project that will launching at that time. I also would be interested in hearing your ideas about this.
Sent from my iPhone
On Jan 2, 2015, at 5:57 PM, apfeffer notifications@github.com wrote:
Hi Ted,
This would definitely be interesting to us, but we currently have no resources or expertise to do this. Is this something you would be interested in? I’d love to talk to you if you have ideas.
Thanks,
Avi
- + + + + + + + + + + + Avi Pfeffer Charles River Analytics Inc. apfeffer@cra.commailto:apfeffer@cra.com (617) 491-3474x513
From: Ted [mailto:notifications@github.com] Sent: Wednesday, December 31, 2014 1:56 AM To: p2t2/figaro Subject: [figaro] Figaro on Spark (#347)
Hi,
Would the Figaro project benefit from a scalable, distributed MCMC implementation? Spark seems to be a natural candidate for this.
This kind of work has been done using PySpark on pymc: http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347. — Reply to this email directly or view it on GitHub.
@apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.
@motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.
I have some experience with Spark and could probably help out on this.
However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).
Brian
From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.
@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.
Is this to be able to use resources more efficiently or to parallelize the computations?
If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.
What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.
One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.
Glenn
Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.com
From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
I have some experience with Spark and could probably help out on this.
However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).
Brian
From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.
@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.
It might be worth having a look at emcee (Python) which seems to be capable of multi threading: http://dan.iel.fm/emcee/current/user/advanced/#multiprocessing
With MCMC, you can still run N parallel versions of the chain and compile the samples at the end.
The real impediment to getting parallel MCMC to run on Figaro is that Figaro models are not stateless. That is, we persist the values of each element in a model inside the model class. This presents a big problem for parallel MCMC (and importance sampling for that matter). You have to either run N MCMC algorithms on N copies of you model, or remove the state from the elements and have MCMC maintain N states of the model. Both are valid options but aren’t trivial to implement.
This is also why I suggested BP as a good vehicle for parallel computation.
Brian
From: Glenn Takata [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:51 AM To: p2t2/figaro Cc: Brian Ruttenberg Subject: Re: [figaro] Figaro on Spark (#347)
Is this to be able to use resources more efficiently or to parallelize the computations?
If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.
What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.
One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.
Glenn
Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.comhttp://www.cra.com%3chttp:/www.cra.com
From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
I have some experience with Spark and could probably help out on this.
However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).
Brian
From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.
@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68709695.
We have already discussed removing the state from Figaro elements and replacing them with views. We had first considered it as a method of improving the code base and making it more functional, but if there’s a real need for it perhaps now is the time to do it.
From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 9:54 AM To: p2t2/figaro Cc: Avi Pfeffer Subject: Re: [figaro] Figaro on Spark (#347)
With MCMC, you can still run N parallel versions of the chain and compile the samples at the end.
The real impediment to getting parallel MCMC to run on Figaro is that Figaro models are not stateless. That is, we persist the values of each element in a model inside the model class. This presents a big problem for parallel MCMC (and importance sampling for that matter). You have to either run N MCMC algorithms on N copies of you model, or remove the state from the elements and have MCMC maintain N states of the model. Both are valid options but aren’t trivial to implement.
This is also why I suggested BP as a good vehicle for parallel computation.
Brian
From: Glenn Takata [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:51 AM To: p2t2/figaro Cc: Brian Ruttenberg Subject: Re: [figaro] Figaro on Spark (#347)
Is this to be able to use resources more efficiently or to parallelize the computations?
If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.
What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.
One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.
Glenn
Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.comhttp://www.cra.com%3chttp:/www.cra.com<http://www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com>
From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
I have some experience with Spark and could probably help out on this.
However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).
Brian
From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.
@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68709695.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68717836.
Isn’t the point of these BigData technologies that you can have each server maintain its own state? If so, each server could run its own version of the model and avoid the statefulness problem ( assuming that they really are parallel).
Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.com
From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 9:54 AM To: p2t2/figaro Cc: Glenn Takata Subject: Re: [figaro] Figaro on Spark (#347)
With MCMC, you can still run N parallel versions of the chain and compile the samples at the end.
The real impediment to getting parallel MCMC to run on Figaro is that Figaro models are not stateless. That is, we persist the values of each element in a model inside the model class. This presents a big problem for parallel MCMC (and importance sampling for that matter). You have to either run N MCMC algorithms on N copies of you model, or remove the state from the elements and have MCMC maintain N states of the model. Both are valid options but aren’t trivial to implement.
This is also why I suggested BP as a good vehicle for parallel computation.
Brian
From: Glenn Takata [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:51 AM To: p2t2/figaro Cc: Brian Ruttenberg Subject: Re: [figaro] Figaro on Spark (#347)
Is this to be able to use resources more efficiently or to parallelize the computations?
If the latter, remember that a MCMC sample is dependent on calculations based the previous sample so you still have to do the sampling “in sequence” and the acceptance decision has to compare and select the results from two samples. You don’t gain much from parallelization in this scenario, since, unlike straight MC methods, the samples are not iid.
What you can gain is the ability to distribute the computation to more powerful machines and to save on local storage during the computations. Note that this does not necessarily improve processing speed if the distributed platform is no more powerful than the local one. It may even slow things down as network traffic comes into play.
One thing to think about for all out algorithms is whether the inference calculations can be sequenced so that parts can be implemented in parallel. A simple example would be something like A depends on (B,C) where B,C do not depend on A (ie no loop). Then B and C could be calculated independently and applied to A when both calculations are done. This is a standard use of Futures. Spark, Akka, and other tools would help speed up processing in this instance, but requires a lot more analysis on our part.
Glenn
Glenn Takata Charles River Analytics Inc. 617.491.3474 x625 www.cra.comhttp://www.cra.comhttp://www.cra.com%3chttp:/www.cra.com<http://www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com%3chttp:/www.cra.com>
From: bruttenberg [mailto:notifications@github.com] Sent: Monday, January 05, 2015 8:16 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
I have some experience with Spark and could probably help out on this.
However, given the close relationship between Spark and distributed graph frameworks (i.e., GraphX), I think a real natural integration would be to get a distributed Belief Propagation algorithm running on top of Spark (possibly using GraphX components).
Brian
From: Ted [mailto:notifications@github.com] Sent: Saturday, January 03, 2015 4:33 AM To: p2t2/figaro Subject: Re: [figaro] Figaro on Spark (#347)
@apfefferhttps://github.com/apfeffer Sure! Still learning Figaro but I can help with the Spark side of things.
@motownbluehttps://github.com/motownblue At this point, my vague idea on how to proceed first would be to create an RDD-like Element class. RDDs are the main data structure in Spark and combining it with Figaro's Element seems to be essential.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68588897.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68705998.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68709695.
— Reply to this email directly or view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-68717836.
@tcfuji As far as I am conferenced the first step before you even start with RDDs has to be serializable element implementation.
Possibly subsumed by other projects
@apfeffer - assuming you closed this because of classification to "future work". Do you see a case for a review of implementation (towards a more functional one) so that road is paved for porting to Spark?
There are a few other interested things that could be leveraged https://devblogs.nvidia.com/parallelforall/gpus-accelerate-epidemic-forecasting/
“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”
I'm still very interested in seeing Figaro on Spark. Unfortunately we don't have the resources to do this right now, but I would still like to see it happen somehow.
Avi
++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.com www.cra.com
On Mon, Jun 12, 2017 at 12:05 PM, SemanticBeeng notifications@github.com wrote:
@apfefferhttps://github.com/apfeffer - assuming you closed this because of classification to "future work". Do you see a case for a review of implementation (towards a more functional one) so that road is paved for porting to Spark?
There are a few other interested things that could be leveraged https://devblogs.nvidia.com/parallelforall/gpus-accelerate-epidemic-forecasting/
"Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak," he says. "With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!"
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-307836195, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd6mhcgXI_7550C0IjFCrrwbJ67wbks5sDWHVgaJpZM4DNajg.
If at all possible to start a public project to migrate first to a more functional approach it would open the road for others.
I looked at this before but it being stateful was a roadblock in the sense that it would be a big effort and hard to tell how and what you would welcome and what not in this re-write.
Maybe you could start with your requirements and motivation for this distributed implementation?
I'm sorry about the slow response. We've had the goal of making Figaro stateless for a long time. However, it's not trivial to implement. Unfortunately, it's always fallen behind developing new functionality. Right now, we really don't have resources to do it ourselves. We'd welcome a public effort to make Figaro functional.
We're hoping to release Figaro 5.0 in August or so, after which would be a good time to take on a fundamental task like this.
A question I have for you: Do you believe a stateless implementation would be necessary for implementing Figaro on Spark? If so, please explain why.
Thanks,
Avi
++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.com www.cra.com
On Tue, Jun 13, 2017 at 2:12 PM, SemanticBeeng notifications@github.com wrote:
If at all possible to start a public project to migrate first to a more functional approach it would open the road for others.
I looked at this before but it being stateful was a roadblock in the sense that it would be a big effort and hard to tell how and what you would welcome and what not in this re-write.
Maybe you could start with your requirements and motivation for this distributed implementation?
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-308202372, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkdytpnQmWifytImm0JT7CV7DRTGGfks5sDtEegaJpZM4DNajg.
Do you believe a stateless implementation would be necessary for implementing Figaro on Spark?
Not a stateless (maybe even impossible) but a functional one in the sense that the mutable state
is managed visibly/explicitly as opposed to implicitly like in OO and the state mutations
are isolated into explicit effects
.
The intuitive motivation for applying this powerful technique from functional programming is to ease distributed computation when running in a cluster in large (Spark) and even for running in a cluster in small (multi GPU).
When crossing JVM / worker / GPU boundaries such an implementation may be necessary so that the computation and its effects can be used without having to become bound to a location.
If this intuition applies to Figaro
and its use cases would have to be validated by writing thorough specs for how we would want to use it is Spark
.
The authority for how to use effects
how they are implemented in Scala is @etorreborre.
See more about this in https://github.com/atnos-org/eff.
If this makes enough sense, when Figaro 5.0
is out maybe we can write these specs for you see it used in Spark
and re-evaluate?
There is previous work on probabilistic programming with monads. I would probably start there.
Of course you would reference a Haskell paper, @etorreborre - the proper and best way to start. :+1: Does https://github.com/atnos-org/eff. have enough support to port this Haskell code to Scala?
The paper mentions MCMC which is used in Figaro
I will re-mention here what I said earlier because it makes the case for using GPUs https://devblogs.nvidia.com/parallelforall/gpus-accelerate-epidemic-forecasting/
“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”
@SemanticBeeng I have no clue because I never did probabilistic programming but it might be possible yes.
This is very interesting.
Let me describe the ways Figaro is currently stateful and then we can think about what to do about it.
There are two main stateful classes: Element and Universe. An Element represents a random variable; its state consists of (1) any conditions and constraints that have been added to the element; (2) its current value under a sampling process. I believe that #2 should really be moved out of Element and made a view that is part of an inference algorithm that uses those values, rather than an intrinsic property of the element. #1 does not seem as worrisome, as this is "static" and should not change at all during inference.
Universe is where I think the difficulty lies. A Universe contains a set of elements as well as information about the dependencies between the elements. During execution, elements get added to and removed from a universe. I'm having a hard time imagining how this could be done in a functional way, but this is probably my lack of imagination and experience with functional programming.
I'd appreciate any thoughts you have.
Thanks,
Avi
++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.com www.cra.com
On Thu, Jun 29, 2017 at 6:06 AM, Eric Torreborre notifications@github.com wrote:
@SemanticBeenghttps://github.com/semanticbeeng I have no clue because I never did probabilistic programming but it might be possible yes.
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-311922329, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd7mECBBJkhBWfPFBre0rJ5U89xiBks5sI3c1gaJpZM4DNajg.
Avi, why couldn’t you have a view on Universes? The “value” of a universe is really just the set of elements and dependencies that exists in the universe. Not saying it would be easy to implement, but conceptually it could work.
From: apfeffer [mailto:notifications@github.com] Sent: Monday, July 3, 2017 1:27 PM To: p2t2/figaro figaro@noreply.github.com Cc: Brian Ruttenberg bruttenberg@cra.com; Comment comment@noreply.github.com Subject: Re: [p2t2/figaro] Figaro on Spark (#347)
This is very interesting.
Let me describe the ways Figaro is currently stateful and then we can think about what to do about it.
There are two main stateful classes: Element and Universe. An Element represents a random variable; its state consists of (1) any conditions and constraints that have been added to the element; (2) its current value under a sampling process. I believe that #2 should really be moved out of Element and made a view that is part of an inference algorithm that uses those values, rather than an intrinsic property of the element. #1 does not seem as worrisome, as this is "static" and should not change at all during inference.
Universe is where I think the difficulty lies. A Universe contains a set of elements as well as information about the dependencies between the elements. During execution, elements get added to and removed from a universe. I'm having a hard time imagining how this could be done in a functional way, but this is probably my lack of imagination and experience with functional programming.
I'd appreciate any thoughts you have.
Thanks,
Avi
++++++++++++++++++++++ Dr. Avi Pfeffer Chief Scientist Charles River Analytics, Inc. (617) 491-3474x513 apfeffer@cra.commailto:apfeffer@cra.com www.cra.comhttp://www.cra.com
On Thu, Jun 29, 2017 at 6:06 AM, Eric Torreborre notifications@github.com<mailto:notifications@github.com> wrote:
@SemanticBeenghttps://github.com/semanticbeeng I have no clue because I never did probabilistic programming but it might be possible yes.
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-311922329, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkd7mECBBJkhBWfPFBre0rJ5U89xiBks5sI3c1gaJpZM4DNajg.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-312699177, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJOJRfkVs6HdJB8lc88LO9p4u1nfVqgks5sKSRWgaJpZM4DNajg.
Hi @apfeffer, the "classical" FP way to deal Universe
is to make it immutable and pass it from function to function using it as a "Monad" context. So you can write something like that:
def myFunctionWithUniverse(p1: Int, p2: Int): State[Universe, Int] =
for {
// get the current universe if you need it
universe <- get[Universe]
// modify the current universe
_ <- put(universe.setParameter(p1 + p2))
} yield p1 + p2
// this function uses a stateful function but doesn't access the Universe directly
def otherFunction: State[Universe, Int] =
for {
i <- myFunctionWithUniverse(1 , 2)
} yield i * i
This story becomes a bit more complicated when you have other "effects" (errors for example) or concurrency to thread in.
@apfeffer I am somewhat familiar with the code and agree that the Universe
, this global state needs to be decomposed somehow.
Maybe we can start with use cases that you have in mind for using Figaro
in Spark (distributed) where the current implementation would be an issue.
Another direction would be to make explicit the computations that have effects mutating the Universe
and giving them business names so we can reason about them.
This mindset will become more clear if you can be kind to read about task-based UIs
, event sourcing
CQRS
, immutable domain model
, etc
https://cqrs.wordpress.com/documents/task-based-ui/
https://medium.com/technology-learning/event-sourcing-and-cqrs-a-look-at-kafka-e0c1b90d17d8
https://tech.zilverline.com/2011/02/01/towards-an-immutable-domain-model-introduction-part-1 https://tech.zilverline.com/2011/02/02/towards-an-immutable-domain-model-immutable-change-part-2 https://tech.zilverline.com/2011/02/05/towards-an-immutable-domain-model-immutability-achieved-part-3 https://tech.zilverline.com/2011/02/07/towards-an-immutable-domain-model-believe-the-type-part-4
https://tech.zilverline.com/2011/02/10/towards-an-immutable-domain-model-monads-part-5
I am ready to clarify if you have questions about the material in the links above.
Please be kind to skim and advise if
Just FYI, this paper just showed up on Arxiv:
https://arxiv.org/pdf/1707.02047.pdf
The authors propose a PPL on top of Spark. Seems relevant.
(a) We present the extension of Scala’s syntax that can express various sophisticated Bayesian network models with ease. (b) We present the details of compiling and executing an InferSpark program on Spark. That includes the mechanism of automatic generating efficient inference codes that include checkpointing (to avoid long lineage), proper timing of caching and anti-caching (to improve efficiency under memory constraint), and partitioning (to avoid unnecessary replication and shuffling). (c) We present an empirical study that shows InferSpark can enable statistical inference on both customized and standard models at scale.
Sounds very nice. Hope they publish code.
One of the key parts is how they implement MCMC
Parallelism in MCMC is hard because MCMC is inherently a serial algorithm https://stats.stackexchange.com/questions/204326/what-makes-parallel-distributed-probabilistic-inference-difficult-to-implement
found this also interesting https://github.com/tensorprob/tensorprob
The posterior distribution (or likelihood function) are constructed and evaluated using TensorFlow, which means you can make use of multiple CPU cores and GPUs simultaneously. This also makes it easy to add new custom probability distributions by using the symbolic operators defined in TensorFlow.
Again the GPU thing comes up.
Found this great article studying HMM
and Viterbi
: https://mioalter.wordpress.com/2016/02/13/hmm-hidden-markov-models-with-figaro/
Shows how Figaro
fits from a functional programming
paradigm perspective.
It also gives insights about elements of probabilistic programming
like probability distribution
s and belief propagation
in the context of HMM
.
hmm.observable(1).observe(Walk)
println("After observing Walk on day 1: " + VariableElimination.probability(hmm.hidden(2), Sunny))
Further digging lead to distributed belief propagation
on Spark
.
https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala and https://github.com/HewlettPackard/sandpiper
The project contains an implementation of Loopy Belief Propagation, a popular message passing algorithm for performing inference in probabilistic graphical models. It provides exact inference for graphical models without loops. While inference for graphical models with loops is approximate, in practice it is shown to work well. Our implementation is generic and operates on
factor graph
representation ofgraphical models
. It handles factors of any order, and variable domains of any size. In addition, we provide specialized implementation for pairwise factors. The algorithm is implemented withApache Spark GraphX,
and thus can scale to large scale models . Further, it supports computations in log scale for numerical stability.
In my mind this is a great use case for how Figaro
would fit with Spark
and gives a good design /architecture decomposition of what it is made of in order reason about what needs to change for making it distributed.
@apfeffer - thoughts please? can we work with something like this to drive an approach?
Found deep applications of functional programming to probabilistic programming
Effects in Bayesian Inference (video) https://www.youtube.com/watch?v=erGWMzzSUCg&list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&index=7
(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1103 applauses about use of function composition
(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1626 key insights about composing MCMC algorithms: sequential monte carlo (SMC) handler + MH handler => particle MCMC handler (!)
[Daniel Huang] Compiling Markov Chain Monte Carlo Algorithms for Probabilistic Modeling https://danehuang.github.io/papers/augurv2.pdf
video bookmark: https://youtu.be/qrpGX-ZaP6w?t=250
[Daniel Huang] An application of computable distributions to the semantics of probabilistic programming languages https://danehuang.github.io/papers/compsem.pdf
[Daniel Huang] On Programming Languages for Probabilistic Modeling https://danehuang.github.io/papers/dissertation.pdf
Thoughts on reuse in Figaro
?
Thanks, SemanticBeeng, those are interesting links. The work by Dan Huang looks very interesting. I know his adviser Greg Morrisett quite well. Since he’s local, I might be able to hook up with him.
Also, I’ve started working on a PP language written in Haskell that might provide fertile ground for some of these ideas before incorporating into Figaro.
Avi
From: SemanticBeeng notifications@github.com Reply-To: p2t2/figaro reply@reply.github.com Date: Thursday, October 12, 2017 at 4:12 AM To: p2t2/figaro figaro@noreply.github.com Cc: Avi Pfeffer apfeffer@cra.com, Mention mention@noreply.github.com Subject: Re: [p2t2/figaro] Figaro on Spark (#347)
Found deep applications of functional programming to probabilistic programming
Effects in Bayesian Inference (video) https://www.youtube.com/watch?v=erGWMzzSUCg&list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&index=7
(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1626 key insights about composing MCMC algorithms: sequential monte carlo (SMC) handler + MH handler => particle MCMC handler (!)
[Daniel Huang] Compiling Markov Chain Monte Carlo Algorithms for Probabilistic Modeling https://danehuang.github.io/papers/augurv2.pdf
video bookmark: https://youtu.be/qrpGX-ZaP6w?t=250
[Daniel Huang] An application of computable distributions to the semantics of probabilistic programming languages https://danehuang.github.io/papers/compsem.pdf
[Daniel Huang] On Programming Languages for Probabilistic Modeling https://danehuang.github.io/papers/dissertation.pdf
These should inspire advancements in Figaro?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/p2t2/figaro/issues/347#issuecomment-336054456, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJkdyqUkrGObL8nT6ozESeWNYhJWeAiks5srcn3gaJpZM4DNajg.
Glad you like Dan's work.
"PP language written in Haskell" sounds like an excellent approach (new start?).
If functional programming
is used right from the core (inferences) then it should be easier to run distributed applications with it.
The world would be a better place if all previous work on #ProbabilisticProgramming
in Haskell was considered and leveraged if deemed useful and if that decision making was visible publicly.
Is that public work?
Will you make it a distributed language / architecture?
This project seems to have solved some of the distributed functional programming challenges: https://github.com/transient-haskell/transient
Hi,
Would the Figaro project benefit from a scalable, distributed MCMC implementation? Spark seems to be a natural candidate for this.
This kind of work has been done using PySpark on pymc: http://blog.cloudera.com/blog/2014/08/bayesian-machine-learning-on-apache-spark/