ENH: parallelize the bellman_operator (and similar)

sglyon commented 10 years ago

This algorithm is easily parallelizable.

As a first shot we might want to implement this using the DArray type. We can refer to this implementation of the game of life to see how it works. If that doesn't give us what we want, we could easily just distribute indices/chunks of the array, have each process update their chunk, then have the master process organize them back into the original array in preparation for the next iteration.

sglyon commented 9 years ago

@cc7768

jstac commented 9 years ago

Random thoughts: One issue is that there are so many different kinds of Bellman operators. Are you thinking of an "example" parallelization for one model that can be adapted to others on an individual basis? Or something more generic?

I've discussed with @oyamad the possibility of putting together a subpackage for working with finite state MCs, including some existing code plus optimization via Bellman operator iteration plus policy function iteration. If that comes together it would be intended as generic code that various models could be fitted into -- and hence an ideal place to implement parallelization and other enhancements.

sglyon commented 9 years ago

@jstac : great point.

I would love for this to be a general tool, but realize we would probably need to write the tools for some subset or class of model types. If you and @oyamad would like to push forward on the MC toolkit I would be very interested in helping out.

cc7768 commented 9 years ago

This all sounds very cool. Am interested in these topics and think providing a good framework for this within QE would be a great achievement. Let me know about the progress on the finite state MC toolkit --I would be willing to chip in a little there (although time might be scarce until school is out).

albop commented 9 years ago

@spencerlyon2 and @cc7768 : On parallelizing dynamic programming algorithm: it is a very important topic, with a lot of ongoing research. There are many ways to do it, that are also language specific. Do you know specifically how you want to do it ? To me, the way to go would be to choose a simple baseline model then try and document a few different approaches on it. I'm very curious about it. There are be many things to learn from it. It almost sounds like a good small independent project (that could also be incorporated later).

sglyon commented 9 years ago

@albop Do you have any references or have you otherwise seen any of the "language specific" ways of doing this. I'd be curious to see where others have gone with this.

If anyone has a "good" model that can serve as a baseline, please let me know. I would think that some minimal qualities of a good baseline model would be that it is relatively straightforward and easy to understand (so we can think about the algorithm more than the model), yet takes a (relatively) long time to solve via standard fixed point iteration on the bellman operator (so we can see potential speedups).

albop commented 9 years ago

My point was that in general there are many paradigm for parallelism: SIMD, MIMD, inprocess, multicore, concurrent/parallel (look at http://en.wikipedia.org/wiki/Parallel_computing) Many languages provide syntactic sugar like parfor, coroutines (or goroutines), futures. But the kind of performance gains you can get out of it depend on a lot of factors. For instance Python coroutines are actually executed on a single core, while the performance of the Julia model is still partly uknown and very dependent from the architecture (from what I remember from the ML). If the point is about optimizing a function by making it parallel, then it would be nice to show which ways is actually efficient, in which context (multi core, grid, etc.). As for a basic model, you could just take the neoclassical one and increase the number of grid points and shocks dramatically. For a first pass, it doesn't really matter as long as it is well behaved.

sglyon commented 9 years ago

Thanks for the feedback. Very useful

I see what you meant about wrt parallelism.

And we could definitely start with the growth model and move forward from there (I'm assuming by neoclassical you meant neoclassical growth model).

albop commented 9 years ago

Yes, the neoclassical growth model. Keep me posted about your progress.

sglyon commented 9 years ago

Will do. It might be somewhat slow as classes start again this week, but I will work on it

oyamad commented 9 years ago

Hi all, I have been working on coding a Markov Decision Processes (MDP) class (while I am not totally sure if this is exactly what @jstac has in mid) to submit to QE, which has been work in progress for a while: https://github.com/oyamad/mdp

There are two tasks yet to be done:

Integrate two pieces of code, in which a transition matrix is
1. a 3-dimensional array defined on S x A x S
2. a 2-dimensional array defined on SA x S
where S is the state space, A the action space, and SA the set of (feasible) state-action pairs. (Note that the second allows sparse matrix representation.)
Write docstring.

I have been learning by teaching on this matter; you might take a look at my course repository: https://github.com/oyamad/theory14

Any suggestions would be greatly appreciated (though this, Julia side, may not be a good place to start a discussion on this...).

jstac commented 9 years ago

For the record I've had a look at @oyamad 's code and I think it's excellent. Just what I had in mind. The notebook illustrations are also excellent. I'm constrained by time sensitive work commitments for the next month or two but after that I'm very keen to get it merged. And, being generic, I think it would be an ideal target for parallelization.

sglyon commented 9 years ago

@oyamad any update on the mdp code? I've just taken a look and it is indeed well written.

It reminds me of the ddpsolve function from Miranda and Fackler.

I think having a highly optimized (parallelized?) version of this would be very cool.

oyamad commented 9 years ago

@spencerlyon2 I haven't had the chance to touch the code since the previous comment of mine above...

I suspect the function from Miranda and Fackler does not allow sparse matrix inputs; one way to allow them is to adopt the "state-action pairs formulation", where the transition matrix is represented by a 2-dimensional array defined on SA x S, with SA being the set of (feasible) state-action pairs. I need to integrate the two approaches, the standard one with S x A x S and the SA x A approach, into one piece.

It would be nice if someone would be willing to help; in that case, I should put the code in a new branch in QuantEcon.py.

I have no idea (and knowledge) about parallelization (in general as well as for DP). Could you suggest a pointer to any helpful reference?

nilshg commented 9 years ago

Not sure this is helpful, as I haven't gone through the exact problem you want to parallelize, but I've written a parallel version of a simple finite life-cycle model which is being solved recursively here.

The basic idea is just to have an Array{Float64,N} each for value and policy functions, where N is the number of state variables plus one (the time dimension), and then have a SharedArray{Float64,N-1} to which the optima of the Bellman equation for each combination of state variables are written. Having both these arrays allows one to easily store the solutions to the maximization problem in parrallel, but do all other necessary calculations (such as the interpolation) with regular Arrays, so that one doesn't have to extend all sorts of other functions to accept SharedArrays. Basically (in pseudo code):

v = Array(Float64, dims)
w_prime = Array(Float64, dims)
v_opt = SharedArray(Float64, dims-1)
wp_opt = SharedArray(Float64, dims-1)

for t = T:-1:1
  v_prime = interpolateV(v[dims..., t+1], grids)  
  @sync @parallel for x=1:xpoints, y=1:ypoints, ... # all state variables
     (v_opt[x,y,...], wp_opt[x,y,...]) = solveBellman(x,y,...,v_prime)
  end
  v[:, :, ..., t] = sdata(v_opt)
  w_prime[:, :, ..., t] = sdata(wp_opt)
end

I believe this is a reasonably simple and general structure that can be applied to most existing code without many changes. If I run this on my local machine wit 4 physical cores, running this with just one core takes indeed three times as long as running it with 3 worker cores, so the speedup seems proportional to the number of cores used.

The obvious limitation is that SharedArrays won't work with clusters, so this is best for medium-sized problems which can be run on a local server.

sglyon commented 9 years ago

Hey @nilshg this is great! Thanks for sharing.

Do you have a working version of this algorithm/implementation somewhere publicly visible?

nilshg commented 9 years ago

Yes, as part of my thesis I'm replicating Guvenen (AER, 2007), and my implementation is in the LearningModels repository.

It's not quite there in terms of actually replicating the results, but it's a working version in the sense that if you unzip the repo, open the NHL/NHL_0_JuliaCode.jl file, adjust the path variable to where you've saved the repo and run it, it should solve the recursive consumer problem (baseline 40 periods with 6600 points in the state space per period) in about 3-4 minutes.

[NB: this is written in 0.4 syntax, so you won't be able to run it in 0.3.11!]

jstac commented 9 years ago

I think these are exactly the kind of parallelization routines we want to start adding to QuantEcon, with shared memory rather than clusters: Routines that can take advantage of multiple cores and up-spec machines without having to put a lot of effort into custom code. On the Python side, the discrete DP code that @oyamad has put together seems like a natural target.

QuantEcon / QuantEcon.jl

ENH: parallelize the bellman_operator (and similar) #13