Open MaximilianJHuber opened 8 years ago
Profiling the "solve"-function suggests nothing unusual:
BUT: line 319 and 380 in ddp.jl have:
vals = ddp.R + ddp.beta * ddp.Q * v
but really should be:
vals = ddp.R + ddp.beta * (ddp.Q * v)
.
Runtime drops by a factor of 6 and memory allocations drop from 500 MB to 2 MB of this single line!
Great sleuthing! This makes perfect sense!
Do you mind putting that change in a pull request?
I did not know what that means 30 minutes ago, but I hope I did the right thing!
A new performance test with the altered matrix multiplication yielded:
constructor 0.773396 seconds (2.02 k allocations: 466.133 MB, 2.01% gc time)
solve 1.922052 seconds (616 allocations: 108.402 MB, 1.30% gc time)
While Python has:
constructor CPU times: user 103 ms, sys: 8.35 ms, total: 112 ms
solve CPU times: user 56.5 ms, sys: 2.05 ms, total: 58.5 ms
So the issue remains open.
Further profiling points to linear algebra to be the most expensive type of operation, for both PFI: and VFI:
@oyamad I saw you are working on a number of memory/performance issues concerning ddp.jl. May I suggest to make Q sparse throughout ddp.jl?
I played around with the evaluate_policy-function (ddp.jl:451) and having Q sparse (see the task above) leads to a ridiculously high performance gain (factor 22000) and I guess it will outweigh other disadvantages of sparse matrices. Having Q sparse in the bellman_operator!-function (ddp.jl:318) yields a performance boost of factor 25.
You will probably want to be careful about making Q sparse without first having some kind of check as to whether a sparse matrix is appropriate. I won't claim to know much about sparse matrices but I imagine this is easy enough to check.
There are many economic applications where Q will not be a sparse matrix. Eg., markov has two states, employment and unemployment or recession and boom, and Q is transition between these, as eg. Imrohoroglu (1989), or the 4x4 Q matrix on labour efficiency units used by Castañeda, Díaz-Gímenez & Río-Rull (2003) to match the US lorenz curves for income and wealth inequality.
Equally, for many examples Q is probably going to be the Tauschen method of discretizing an AR(1) and these will be sparse (or close enough) so there are lots of examples where the speed boost @MaximilianJHuber is describing might be gained.
Ok, that is true. The user must decide.
I gave it a try and altered line 51 in ddp from: Q::Array{T,NQ}
to Q::AbscractArray{T,NQ}
(I know that is quick and dirty, because composite types should have concrete types a fields!)
and line 87 from new(R, full(Q), beta, _a_indices, a_indptr)
to new(R, Q, beta, _a_indices, a_indptr)
Then the constructor allocates for grid_size=500 only 12 MB instead of 480 MB with a runtime improvement by factor 20 and solve allocates 43 MB instead of 108 MB with a runtime improvement by factor 17.
Are there expert opinions on sparse matrices? Or is the first order of business to redesign ddp.jl in order to gain performance?
I guess that once we decide on the internal data structure (visit #124 to join the discussion), it should be straightforward to implement methods that preserve the sparsity of the input matrix.
DiscreteDP performance might be bad in the Julia version.
For example, this Python code runs 220 times faster (local 2x2.66GHZ, 4 GB machine) than this Julia code. (The task is described here).
Could someone please replicated this performance issue?