Open mohamed82008 opened 5 years ago
Hi Mohamed!
I just added a README, where I briefly go into how the library works.
Turing is now very close to being fully type stable and Chris Rackauckas already reports a 15x speedup in his DiffEqBayes which uses Turing.
Cool! I remember once seeing benchmarks suggesting Turing was 10x slower than Stan. Is its performance now rather competitive?
Eliminating all unnecessary allocations and enabling the full use of static arrays is also something on my radar, but I am curious if there are other tricks that you suggest. I see you rely a lot on PaddedMatrices.jl
PaddedMatrices
are much faster than StaticArrays
, whether or not they actually have their namesake padding. Explicitly avoiding padding (parameters are nrows, ncols, element type, stride; if stride == nrows, then there is no padding):
julia> using PaddedMatrices, StaticArrays, BenchmarkTools, LinearAlgebra, Random
julia> pA = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); randn!(pA); #to avoid padding
julia> pB = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); randn!(pB); #to avoid padding
julia> pC = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef);
julia> sA = SMatrix{7,7}(pA);
julia> sB = SMatrix{7,7}(pB);
julia> @btime $sA * $sB
67.005 ns (0 allocations: 0 bytes)
7×7 SArray{Tuple{7,7},Float64,2,49} with indices SOneTo(7)×SOneTo(7):
2.96419 4.83028 -2.11728 1.00643 -5.69249 4.66337 2.24875
-0.446299 -1.35403 0.00484316 -0.323731 -0.155002 -1.72822 2.07241
1.43737 -0.531415 -3.50058 -0.626386 -3.66252 3.2198 0.504351
1.35284 -1.06179 -2.70543 -2.05564 -2.45683 2.49386 0.958187
1.14669 -1.48968 -3.22534 -1.49174 -2.56398 2.87688 -0.840148
1.33924 2.57827 1.07435 7.49071 6.52194 1.27062 -0.253103
-2.73455 -0.357343 3.97818 -4.22651 -0.354159 -4.7175 -1.29675
julia> @btime mul!($pC, $pA, $pB); pC
17.997 ns (0 allocations: 0 bytes)
7×7 MutableFixedSizePaddedArray{Tuple{7,7},Float64,2,7,49}:
2.96419 4.83028 -2.11728 1.00643 -5.69249 4.66337 2.24875
-0.446299 -1.35403 0.00484316 -0.323731 -0.155002 -1.72822 2.07241
1.43737 -0.531415 -3.50058 -0.626386 -3.66252 3.2198 0.504351
1.35284 -1.06179 -2.70543 -2.05564 -2.45683 2.49386 0.958187
1.14669 -1.48968 -3.22534 -1.49174 -2.56398 2.87688 -0.840148
1.33924 2.57827 1.07435 7.49071 6.52194 1.27062 -0.253103
-2.73455 -0.357343 3.97818 -4.22651 -0.354159 -4.7175 -1.29675
julia> pA = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); randn!(pA); #to avoid padding
julia> pB = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); randn!(pB); #to avoid padding
julia> pC = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef);
julia> sA = SMatrix{23,23}(pA);
julia> sB = SMatrix{23,23}(pB);
julia> @btime $sA * $sB
1.404 μs (0 allocations: 0 bytes)
23×23 SArray{Tuple{23,23},Float64,2,529} with indices SOneTo(23)×SOneTo(23):
0.676137 -11.5472 -4.29015 -5.12626 -6.57682 7.23703 9.99614 1.76468 1.25541 … -2.59669 -5.0571 -0.214485 -0.683722 -1.23128 1.8082 -1.67359 7.54335
-4.13968 -5.58828 3.71343 -13.4674 4.54442 -1.58558 -3.78945 -0.200756 3.10997 5.18093 -5.25095 -5.6478 3.95834 -6.12274 -0.548574 -3.54803 -0.856037
-11.2188 -12.0339 -6.31261 5.65743 1.7753 -6.29977 -0.461274 10.0776 -5.18723 3.58942 -6.23296 -0.321895 -3.41774 12.1922 6.50792 -5.91061 2.43976
-0.184734 -5.50193 2.33278 -8.37733 1.12845 0.951583 -4.23231 -2.21098 -2.55599 6.56053 -4.02659 -10.9366 2.20055 -8.96011 0.196862 -4.49699 -0.793005
3.12704 2.14365 5.85391 3.2093 1.50542 -5.46851 1.81043 -3.89439 2.17621 1.07769 7.80411 13.9072 8.97308 10.7278 -7.19291 -2.52473 -6.58659
2.44971 0.73773 -1.79319 10.0336 -6.22129 0.31953 0.783538 -6.55799 1.31856 … -3.41797 -1.6813 5.77539 -3.08049 1.66235 1.25254 7.00309 2.6951
-11.154 6.43822 2.14318 5.81669 2.03786 1.15025 -2.11491 4.13416 8.00949 -5.1131 3.3923 0.778615 0.679363 1.86173 1.83339 1.46648 6.09545
5.98668 5.10285 5.45983 -1.18721 -3.70755 7.0925 5.103 -2.27621 3.84499 -0.781866 -3.23418 -2.3769 -2.37448 -3.30754 0.933589 -2.75913 0.364206
-13.4849 -2.86356 3.653 3.56056 0.322314 1.27395 1.4939 -1.7558 9.98444 3.24293 -15.4737 -9.07452 13.5957 -2.15837 3.18083 0.976493 4.45786
-1.0082 -3.38966 -4.02155 -5.5053 -5.37906 -0.376008 -0.725058 -3.56403 -1.8769 1.8387 -2.37666 -5.56774 -1.97787 0.797312 -4.13205 -0.787578 -1.83671
1.75811 8.66197 7.20415 -1.34933 -5.1398 -1.08476 -0.984257 -7.62245 6.51987 … -6.01005 10.8171 1.81885 11.1168 -3.42523 -4.10206 1.99251 -3.41132
0.112358 -6.47499 0.316 0.831317 -8.86483 4.47975 7.09558 -7.69581 5.89953 1.59308 -11.7264 -0.805044 3.32148 -1.31097 3.71042 2.3723 1.2146
-3.47975 -10.0731 2.89155 0.404549 1.46972 -4.84939 7.18921 5.47791 2.59437 -0.9747 -5.06334 3.17199 1.14994 0.0592599 10.4362 4.56883 -3.82712
-6.85618 -7.40577 -6.59401 5.38784 -5.41918 2.10527 2.45604 -0.0598362 0.916278 -0.824134 -4.72543 4.11253 -3.92668 6.14167 4.72191 0.0359086 -0.280309
1.18071 -4.55732 -4.25194 3.94757 6.57966 -3.3513 1.16814 -1.839 -7.05042 7.78813 -3.62933 -5.2294 -0.56637 -5.63994 0.243361 -0.88296 -0.852351
-5.1927 1.47331 1.0722 2.34575 -0.763959 -3.91154 2.47275 -1.25566 5.29109 … -3.73648 -3.51496 -1.11513 0.0217914 2.44424 -4.29189 1.04555 -0.91635
2.23114 -3.24984 3.38598 -6.09565 3.0708 -3.3247 0.311149 4.43763 -2.23465 0.223676 1.95088 4.8373 -3.34843 -3.04138 -1.6914 7.67875 1.10271
-5.53478 -4.13183 -3.21564 -0.276751 -0.600247 -1.34508 4.92285 -0.811928 3.37869 1.01024 -2.3416 -1.61964 2.57656 4.31843 1.71025 4.76718 -1.24788
5.00821 -11.9721 -9.81844 -0.501608 2.79921 -1.13599 7.97383 1.7627 -1.22191 -0.620024 5.66534 2.76723 6.48373 1.60909 -5.19443 -0.565687 6.81647
-3.93591 5.5759 2.85072 0.615663 2.40643 10.4825 4.17097 -0.922532 11.3389 -2.8813 -3.35798 2.70551 1.72967 3.04221 9.47624 -2.58001 10.8072
-4.28853 2.70421 -3.46872 -3.06659 6.95763 -6.55236 0.0519248 6.77244 -1.61432 … -5.24371 10.8945 15.6312 5.20904 11.029 1.48784 1.90943 -4.33497
0.603864 -8.96422 8.12744 -2.28824 -7.73699 1.27218 7.00939 -2.23641 7.92207 -2.33262 -2.84921 -0.131382 2.8406 -4.33235 -3.18134 0.814096 5.58209
-0.471804 -0.644547 -0.0565805 -5.42696 1.46038 -9.56554 -5.56412 -7.40805 -4.75961 6.8637 1.74527 -7.12773 6.29815 0.736255 -15.2387 -2.97371 -10.2457
julia> @btime mul!($pC, $pA, $pB); pC
273.590 ns (0 allocations: 0 bytes)
23×23 MutableFixedSizePaddedArray{Tuple{23,23},Float64,2,23,529}:
0.676137 -11.5472 -4.29015 -5.12626 -6.57682 7.23703 9.99614 1.76468 1.25541 … -2.59669 -5.0571 -0.214485 -0.683722 -1.23128 1.8082 -1.67359 7.54335
-4.13968 -5.58828 3.71343 -13.4674 4.54442 -1.58558 -3.78945 -0.200756 3.10997 5.18093 -5.25095 -5.6478 3.95834 -6.12274 -0.548574 -3.54803 -0.856037
-11.2188 -12.0339 -6.31261 5.65743 1.7753 -6.29977 -0.461274 10.0776 -5.18723 3.58942 -6.23296 -0.321895 -3.41774 12.1922 6.50792 -5.91061 2.43976
-0.184734 -5.50193 2.33278 -8.37733 1.12845 0.951583 -4.23231 -2.21098 -2.55599 6.56053 -4.02659 -10.9366 2.20055 -8.96011 0.196862 -4.49699 -0.793005
3.12704 2.14365 5.85391 3.2093 1.50542 -5.46851 1.81043 -3.89439 2.17621 1.07769 7.80411 13.9072 8.97308 10.7278 -7.19291 -2.52473 -6.58659
2.44971 0.73773 -1.79319 10.0336 -6.22129 0.31953 0.783538 -6.55799 1.31856 … -3.41797 -1.6813 5.77539 -3.08049 1.66235 1.25254 7.00309 2.6951
-11.154 6.43822 2.14318 5.81669 2.03786 1.15025 -2.11491 4.13416 8.00949 -5.1131 3.3923 0.778615 0.679363 1.86173 1.83339 1.46648 6.09545
5.98668 5.10285 5.45983 -1.18721 -3.70755 7.0925 5.103 -2.27621 3.84499 -0.781866 -3.23418 -2.3769 -2.37448 -3.30754 0.933589 -2.75913 0.364206
-13.4849 -2.86356 3.653 3.56056 0.322314 1.27395 1.4939 -1.7558 9.98444 3.24293 -15.4737 -9.07452 13.5957 -2.15837 3.18083 0.976493 4.45786
-1.0082 -3.38966 -4.02155 -5.5053 -5.37906 -0.376008 -0.725058 -3.56403 -1.8769 1.8387 -2.37666 -5.56774 -1.97787 0.797312 -4.13205 -0.787578 -1.83671
1.75811 8.66197 7.20415 -1.34933 -5.1398 -1.08476 -0.984257 -7.62245 6.51987 … -6.01005 10.8171 1.81885 11.1168 -3.42523 -4.10206 1.99251 -3.41132
0.112358 -6.47499 0.316 0.831317 -8.86483 4.47975 7.09558 -7.69581 5.89953 1.59308 -11.7264 -0.805044 3.32148 -1.31097 3.71042 2.3723 1.2146
-3.47975 -10.0731 2.89155 0.404549 1.46972 -4.84939 7.18921 5.47791 2.59437 -0.9747 -5.06334 3.17199 1.14994 0.0592599 10.4362 4.56883 -3.82712
-6.85618 -7.40577 -6.59401 5.38784 -5.41918 2.10527 2.45604 -0.0598362 0.916278 -0.824134 -4.72543 4.11253 -3.92668 6.14167 4.72191 0.0359086 -0.280309
1.18071 -4.55732 -4.25194 3.94757 6.57966 -3.3513 1.16814 -1.839 -7.05042 7.78813 -3.62933 -5.2294 -0.56637 -5.63994 0.243361 -0.88296 -0.852351
-5.1927 1.47331 1.0722 2.34575 -0.763959 -3.91154 2.47275 -1.25566 5.29109 … -3.73648 -3.51496 -1.11513 0.0217914 2.44424 -4.29189 1.04555 -0.91635
2.23114 -3.24984 3.38598 -6.09565 3.0708 -3.3247 0.311149 4.43763 -2.23465 0.223676 1.95088 4.8373 -3.34843 -3.04138 -1.6914 7.67875 1.10271
-5.53478 -4.13183 -3.21564 -0.276751 -0.600247 -1.34508 4.92285 -0.811928 3.37869 1.01024 -2.3416 -1.61964 2.57656 4.31843 1.71025 4.76718 -1.24788
5.00821 -11.9721 -9.81844 -0.501608 2.79921 -1.13599 7.97383 1.7627 -1.22191 -0.620024 5.66534 2.76723 6.48373 1.60909 -5.19443 -0.565687 6.81647
-3.93591 5.5759 2.85072 0.615663 2.40643 10.4825 4.17097 -0.922532 11.3389 -2.8813 -3.35798 2.70551 1.72967 3.04221 9.47624 -2.58001 10.8072
-4.28853 2.70421 -3.46872 -3.06659 6.95763 -6.55236 0.0519248 6.77244 -1.61432 … -5.24371 10.8945 15.6312 5.20904 11.029 1.48784 1.90943 -4.33497
0.603864 -8.96422 8.12744 -2.28824 -7.73699 1.27218 7.00939 -2.23641 7.92207 -2.33262 -2.84921 -0.131382 2.8406 -4.33235 -3.18134 0.814096 5.58209
-0.471804 -0.644547 -0.0565805 -5.42696 1.46038 -9.56554 -5.56412 -7.40805 -4.75961 6.8637 1.74527 -7.12773 6.29815 0.736255 -15.2387 -2.97371 -10.2457
and not just matrix multiplication:
julia> mC = MMatrix{23,23,Float64}(undef);
julia> @benchmark @. $mC = exp($sA) + $sB
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.600 μs (0.00% GC)
median time: 2.605 μs (0.00% GC)
mean time: 2.611 μs (0.00% GC)
maximum time: 4.084 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark @. $pC = exp($pA) + $pB
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 517.563 ns (0.00% GC)
median time: 531.505 ns (0.00% GC)
mean time: 536.259 ns (0.00% GC)
maximum time: 2.023 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 192
Even without padding, PaddedMatrices
uses a lot of performance optimizations. Although, for multiplication with larger matrices, LAPACK, etc, it'll just use Julia's BLAS library.
how much of this do you think should be on the Bayesian inference side, versus the user-side by their choice of input array types that dispatch to the appropriate methods?
One of ProbabilityModel's performance optimizations is to (pre)allocate a chunk of memory on initialization, and then pass a pointer to this memory to supported functions. Those functions can use the pointer for working memory and/or for allocating the objects they return. In the latter case, they'd increment the pointer.
This currently relies on the PtrArray
and DynamicPtrArray
types from PaddedArrays.
If the pointer isn't incremented, that memory gets to pull double-duty, potentially letting the cache be more "hot" than it would be if you defined some structure that holds every individual array/preallocated structure (a named tuple for type stability?).
I can't argue with flexibility. But if you'd like to implement an optimizing front end, some integration between libraries like this helps IMO.
As for other tricks: the two biggest are that it performs source to source autodiff (avoiding the overhead many AD libraries seem to apply), and defining lots of high level adjoints. I think the latter makes a far bigger difference, and would also be compatible with using different AD libraries, like Zygote. It's probably common for the majority of runtime to be spent inside things like a multivariate normal logpdf. So for those models, defining optimized implementations for Zygote/other AD libraries will get you most of the way to optimal.
I see Turing already has a stdlib/distributions.jl, so I think that'd be a natural extension. Or do you already have something like it somewhere else?
Finally, would you be interested in joining forces?
Sure, I'm interested, and same for working with cscherrer (who I can't seem to tag, I guess because he hasn't participated in this repository). I will be busy for at least the next month, so I'll unfortunately be short on time.
But if we can agree on some strategy, not duplicating our efforts will help us all move forward.
How easily do you think Turing could be adapted to take advantage of the performance tricks I outlined here?
Cool! I remember once seeing benchmarks suggesting Turing was 10x slower than Stan. Is its performance now rather competitive?
We are still working on exact benchmarks but the final benchmarks will be different after https://github.com/TuringLang/Turing.jl/pull/826 and https://github.com/TuringLang/Turing.jl/pull/793 are merged. Hopefully, these will be ready by JuliaCon. But I can guarantee you we are way more than 10x faster than before!
PaddedMatrices are much faster than StaticArrays, whether or not they actually have their namesake padding.
This is a very pleasant surprise to me. I would definitely love to explore the use of other array types in Turing to optimize access pattern. See this issue for more details https://github.com/TuringLang/Turing.jl/issues/825 and this post https://discourse.julialang.org/t/mcmc-landscape/25654/21 for my view of the main components of Turing. The book-keeping struct or what we call VarInfo
in Turing is where a lot of performance magic can happen for small- and medium-sized models.
One of ProbabilityModel's performance optimizations is to (pre)allocate a chunk of memory on initialization, and then pass a pointer to this memory to supported functions. Those functions can use the pointer for working memory and/or for allocating the objects they return. In the latter case, they'd increment the pointer.
We are still not there yet in Turing. Aggressive pre-allocations haven't been done yet, mostly because it wasn't the biggest perf improvement we could work on, and because Turing has quite a few moving parts, e.g. AdvancedHMC. I think this also requires a bottom-up approach where models defined by users should make use of pre-allocations and immutables where possible, sampling algorithms should enable full pre-allocation, then the @model
API and parameter book-keeping can also do the same.
As for other tricks: the two biggest are that it performs source to source autodiff (avoiding the overhead many AD libraries seem to apply), and defining lots of high level adjoints. I see Turing already has a stdlib/distributions.jl, so I think that'd be a natural extension. Or do you already have something like it somewhere else?
Yes, we implement adjoints for various distributions mostly as workarounds for C libraries called by Distributions.jl. But more aggressive adjoints for high level functions are also possible to include if it helps performance. We currently support 2 AD backends: ForwardDiff and Tracker. @willtebbutt also recently used Zygote to define adjoints for Tracker.TrackedArray
for MvNormal
which was a neat integration of 2 AD backends. For reversediff, we still use Tracker because it supports mutation through TrackedReal
even if it is not super performant, and it is generally less crash-prone compared to Zygote. That said, a Zygote support PR was also started in https://github.com/TuringLang/Turing.jl/pull/783, and it is very compact since we have a nice abstraction for various AD backends. More use of Zygote can be made either through Tracker
or independently if we figure out a way to work around mutation.
How easily do you think Turing could be adapted to take advantage of the performance tricks I outlined here?
So there are 2 sides to the performance story in Turing:
logpdf
is the heaviest part, this will be mostly negligible as AD will take most of the time. However, for small problems, this book-keeping can add significant overhead and eliminating it can give significant speedup. One step along this line is discussed in https://github.com/TuringLang/Turing.jl/issues/825. Using PaddedArrays
instead of StaticArrays
is an interesting integration of our approaches. But if we can agree on some strategy, not duplicating our efforts will help us all move forward.
Perhaps we can have a meeting with the other Turing members and Chad to discuss future plans. CC: @yebai, @xukai92, @willtebbutt, @trappmartin, @cpfiffer. Referring to https://discourse.julialang.org/t/mcmc-landscape/25654/21, it seems to me that you are experimenting with a different book-keeping strategy for models which is definitely something you can directly implement in Turing or port it over after it proves successful.
An ideal Turing for me wrt to usability and performance:
~
-based API that doesn't restrict user-defined functions and that doesn't change with the choice of the inference algorithm, e.g. particle samplers vs HMCSo the goal is to enable the following 3 types of optimizations and not get in the way:
I hope this clarifies things from the Turing side.
CC: @cscherrer
I really like the idea of PPL interested people joining force. This way everyone can more easily focus on the parts one is interested in. I think it would be very beneficial if we could work together and have an initial Skype conference.
As a side note. We are extremely fortunate to have Mohamed in the team as most of the people are not necessarily people that are very good in writing efficient Julia code but are more familiar with Bayesian inference and advanced topics of Bayesian statistics. Thus, getting more people familiar with efficient implementations in Julia being involved with Turing would definitely be a great thing.
@trappmartin thanks for your kind words :)
We are still working on exact benchmarks but the final benchmarks will be different after TuringLang/Turing.jl#826 and TuringLang/Turing.jl#793 are merged. Hopefully, these will be ready by JuliaCon. But I can guarantee you we are way more than 10x faster than before!
That is great to hear.
This is a very pleasant surprise to me. I would definitely love to explore the use of other array types in Turing to optimize access pattern.
We can use StaticArrays.MArray for the fields of TypedVarInfo since it won't need to grow in size.
I've done a whole lot of performance work with PaddedMatrices.jl.
Even a very basic example like a simple @inbounds @fastmath
for loop to minimize mystery
julia> using PaddedMatrices, StaticArrays, BenchmarkTools
julia> pa = @btime @Mutable randn(32,32);
1.588 μs (1 allocation: 8.06 KiB)
julia> ma = @btime @MMatrix randn(32,32);
4.729 μs (1 allocation: 8.06 KiB)
They are both mutable structs wrapping an NTuple. They are fully compatible in terms of memory layout, so we can even just
julia> pa.data = ma.data;
julia> [pa[i,j] == ma[i,j] for i in 1:32, j in 1:32 ] # broadcasting isn't reliable at the moment...
32×32 Array{Bool,2}:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
yet, massive difference in performance even for such a simple function:
julia> function my_sum(x)
s = zero(eltype(x))
@inbounds @fastmath for i in 1:length(x)
s += x[i]
end
s
end
my_sum (generic function with 1 method)
julia> @btime my_sum($ma)
876.833 ns (0 allocations: 0 bytes)
21.33859997410549
julia> @btime my_sum($pa)
31.660 ns (0 allocations: 0 bytes)
21.33859997410542
As implied by the times, we have very different assembly (MMatrices
are not vectorized by the compiler, but PaddedMatrices
are):
julia> @code_native my_sum(ma)
.text
; ┌ @ REPL[31]:2 within `my_sum'
vxorpd %xmm0, %xmm0, %xmm0
xorl %eax, %eax
nopw %cs:(%rax,%rax)
; │ @ REPL[31]:4 within `my_sum'
; │┌ @ fastmath.jl:161 within `add_fast'
L16:
vaddsd (%rdi,%rax,8), %xmm0, %xmm0
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
addq $1, %rax
cmpq $1024, %rax # imm = 0x400
; │└└
jne L16
; │ @ REPL[31]:6 within `my_sum'
retq
nopw %cs:(%rax,%rax)
nopl (%rax)
; └
julia> @code_native my_sum(pa)
.text
; ┌ @ REPL[31]:2 within `my_sum'
vxorpd %xmm0, %xmm0, %xmm0
xorl %eax, %eax
vxorpd %xmm1, %xmm1, %xmm1
vxorpd %xmm2, %xmm2, %xmm2
vxorpd %xmm3, %xmm3, %xmm3
nopw %cs:(%rax,%rax)
nopl (%rax)
; │ @ REPL[31]:4 within `my_sum'
; │┌ @ fastmath.jl:161 within `add_fast'
L32:
vaddpd (%rdi,%rax,8), %zmm0, %zmm0
vaddpd 64(%rdi,%rax,8), %zmm1, %zmm1
vaddpd 128(%rdi,%rax,8), %zmm2, %zmm2
vaddpd 192(%rdi,%rax,8), %zmm3, %zmm3
addq $32, %rax
cmpq $1024, %rax # imm = 0x400
jne L32
vaddpd %zmm0, %zmm1, %zmm0
vaddpd %zmm0, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vextractf64x4 $1, %zmm0, %ymm1
vaddpd %zmm1, %zmm0, %zmm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %zmm1, %zmm0, %zmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddpd %xmm1, %xmm0, %xmm0
; │└
; │ @ REPL[31]:6 within `my_sum'
vzeroupper
retq
nopw %cs:(%rax,%rax)
nop
; └
Summing the MMatrix
is not vectorized. The compiler simply loops over vaddsd
(vadd
a s
ingle d
ouble), instead of looping over 4 vaddpd
(vadd
a p
acked vector of d
oubles); 4 is to take advantage of out of order execution / address dependency chains and instruction latency.
(The for loops are the part that say
L[some number]:
a bunch of code
cmpq # compare two integers
jne L[that number] # jump back to location [that number] if comparison was not equal.
)
Note that the only functions getting called on the arrays are length
and getindex
! MMatrices simply don't get along with Julia's autovectorizer at all.
You'll get basically the same performance and assembly with a regular matrix and for i in 1:1024
:
julia> aa = Array(pa);
julia> @btime my_sum1024($aa) # replaced `for i in 1:length(x)` with `for i in 1:1024`
31.105 ns (0 allocations: 0 bytes)
21.33859997410542
julia> @code_native my_sum1024(aa)
.text
; ┌ @ REPL[37]:2 within `my_sum1024'
movq (%rdi), %rax
vxorpd %xmm0, %xmm0, %xmm0
xorl %ecx, %ecx
vxorpd %xmm1, %xmm1, %xmm1
vxorpd %xmm2, %xmm2, %xmm2
vxorpd %xmm3, %xmm3, %xmm3
nopw %cs:(%rax,%rax)
nop
; │ @ REPL[37]:4 within `my_sum1024'
; │┌ @ fastmath.jl:161 within `add_fast'
L32:
vaddpd (%rax,%rcx,8), %zmm0, %zmm0
vaddpd 64(%rax,%rcx,8), %zmm1, %zmm1
vaddpd 128(%rax,%rcx,8), %zmm2, %zmm2
vaddpd 192(%rax,%rcx,8), %zmm3, %zmm3
addq $32, %rcx
cmpq $1024, %rcx # imm = 0x400
jne L32
vaddpd %zmm0, %zmm1, %zmm0
vaddpd %zmm0, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vextractf64x4 $1, %zmm0, %ymm1
vaddpd %zmm1, %zmm0, %zmm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %zmm1, %zmm0, %zmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddpd %xmm1, %xmm0, %xmm0
; │└
; │ @ REPL[37]:6 within `my_sum1024'
vzeroupper
retq
nopw %cs:(%rax,%rax)
nop
; └
So this at least is nothing special about PaddedMatrices
; it (like most Julia objects) simply is not getting in the way of Julia's autovectorizer, while StaticArrays.MArrays
do.
With the 2 optimizations in-place and a whole bunch of inlining, it may be possible for the Julia compiler to avoid allocating any memory on the heap for
TypedVarInfo
! The Julia compiler is good at eliding allocations when mutable objects only live within a single function body. This means if we make sure all the functions which usevi::TypedVarInfo
are properly inlined, the compiler may be able to elide these allocations entirely. Unfortunately, this is not true forBase.Array
since these are implemented inC
so they have special semantics.
I've heard that a bunch of inlining isn't necessarily optimal for performance, and I intend to move away from that route once this issue is resolved.
To be able to avoid inlining everything, I'm relying on having ProbabilityModels.jl
manually manage all the memory needed with its own stack. The "preallocation" comes with a malloc
call when the library is loaded.
Currently, the malloc
is 1GiB, but I'm thinking of increasing it. It seems so far that the OS will only eat up as much memory as you've actually touched, so picking big numbers seems safe / does not actually eat system resources.
Regarding VarInfo
itself is there some succinct summary about what it does / what bookkeeping you have to do?
I think my equivalent is that defining a model named Foo
defines a struct with name Foo
, and a field for every variable in the model that doesn't look like a function and isn't on the left hand side of an equal sign.
Those variables are considered "unknowns": they must either be data/priors or parameters.
The struct is parameterized by the type of every field.
One can then create an instance of the struct, using key word arguments to define each of the unknowns as either particular objects (eg, y = 3
), or as a type (σ = PositiveFloat
) of known size (ie, PositiveFloat
is of length 1, RealVector{23}
is of length 23, LKJCorrCholesky{9}
is of length (9*8)>>1
).
Fields given non-type arguments are now considered known data/priors.
Fields given types are assumed to be unknown parameters, with type equal to that unknown type.
Given a vector with length equal to the total length of all parameters, the generated logdensity/logdensity + gradient functions simply load all the parameters from the vector (adding required Jacobians and their gradients), assigning them to variables with the parameters name. That is, it would add something equivalent to
mangled_name = @inbounds mangled_param_vector_name[mangled_some_index]
σ = Base.exp(mangled_name)
target += mangled_name # log jacobian
(mangling is just referring to ensuring hygiene; target is the same as in Stan)
Julia's compiler obviously has no problem figuring out types from code like this, or optimizing it.
σ
is now defined inside the function body, so any references to σ
in the model find it in the local scope and use it. It is a boring old variable, like any other, without the need for bookkeeping.
The order someone accesses the loaded parameters down the line doesn't matter (for validity), but because it does matter for performance (ie, keep cache hot, or even let things be stored in registers instead of the stack when possible), an eventual optimization might be to better organize the generated function, placing loads closer to first references.
That the mental model of what is going on in the generated functions is so familiar -- ie, exactly what goes on in code we write directly -- is why I keep gravitating to them. No worries about "will the compiler be able to elide this, can it be coerced into optimizing that", etc. Just make the generated function generate the code you would want to write.
It would be an overstatement to say ProbabilityModels only supports static models in the moment, in that saying this would imply it works reliably enough to say it supports anything.
Once the library is more mature, it will be static for quite some time. if/else
statements and loops will take a while. I'm going to have to figure out what to do with code like this (I Meta.lower
user's models as one of the first processing steps):
julia> Meta.@lower for i ∈ 1:5
target += p[i]
end
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1 = 1:5
│ #s9 = Base.iterate(%1)
│ %3 = #s9 === nothing
│ %4 = Base.not_int(%3)
└── goto #4 if not %4
2 ┄ %6 = #s9
│ i = Core.getfield(%6, 1)
│ %8 = Core.getfield(%6, 2)
│ └
│ @ REPL[20]:2 within `top-level scope`
│ %9 = target
│ %10 = Base.getindex(p, i)
│ target = %9 + %10
│ #s9 = Base.iterate(%1, %8)
│ %13 = #s9 === nothing
│ %14 = Base.not_int(%13)
└── goto #4 if not %14
3 ─ goto #2
4 ┄ return
))))
But I plan on letting someone hide dynamic pieces in another function, and just ForwardDiff
(or Zygote
, etc) it automatically if they don't provide a gradient.
Regarding PPL people joining forces, how about also looping in the Gen.jl team? @marcoct
They are doing some cool stuff, but taking a less conventional approach so there may or may not be enough intersection here, but if there is, it would be very powerful (and simplifying to end users).
Who in the Julia∩PPL community will be at JuliaCon? I'll be there for the week, should be a great time to sync
I think @cpfiffer and @willtebbutt are attending. Maybe @xukai92 too.
There's also going to be a Gen.jl talk
@chriselrod thanks for the elaborate response. Sorry it took me a while to get back to you. I am officially on a holiday!
I am definitely sold on PaddedMatrices.jl. I will try to incorporate it in Turing's VarInfo
somehow and see how much speedup I can get.
Regarding VarInfo itself is there some succinct summary about what it does / what bookkeeping you have to do?
The main complexity of VarInfo
comes from it supporting many types of models and inference algorithms. For example, Turing supports dynamic models where the number of parameters can itself be random. Stochastic flow control is also supported so variables can be sampled in different orders. This requires, in the most general case, some sort of identifier for each variable and a dictionary that maps it to its values. The way we do this in Turing is using a NamedTuple
mapping from parameter symbols to their metadata. Each symbol is a Julia scalar, array or array of arrays, and its size can be different in different iterations. This symbol can have multiple parameters, e.g. x[i] ~ Normal(i, 1.0)
in a loop. The most basic form of metadata is the value of the parameters. Additionally, some inference algorithms, e.g. particle samplers, require additional information such as the relative order of parameter sampling wrt to the observations. For this, we keep additional information in the metadata of each symbol.
We also store the distribution of each random variable to enable the transformation of parameters from the constrained parameter space to the Euclidean space and back for Hamiltonian sampling. The distribution also helps us reconstruct the parameter value in its true form from the linearized values since we store the values in a single vector per symbol. Because of its breadth, VarInfo
is a quite complicated data structure. However, one future plan is to make a specialized VarInfo
for static models and enable different types of optimizations under more restricted assumptions.
Your pre-allocation approach is interesting and somewhat non-traditional as far as Julia packages are concerned IIUC. Ideally, I would like to have some form of pre-allocation but without having to manage the pointers explicitly as that may increase the complexity of the code. I would love to read ProbabilityModels.jl to understand your approach better. It may not be that hard after all!
Thanks for your great package once again, and good luck!
Hi Chris!
Thanks for this cool package. I wonder what the intended scope of this package is, and how it compares to something like Turing.jl. Also if you have some performance improvement ideas for Turing.jl, I would be interested to hear them. Note that there is a lot of interest now in the Turing team to improve its performance beyond simple type stability and Julia gotchas and your work here seems quite relevant. Turing is now very close to being fully type stable and Chris Rackauckas already reports a 15x speedup in his DiffEqBayes which uses Turing.
Eliminating all unnecessary allocations and enabling the full use of static arrays is also something on my radar, but I am curious if there are other tricks that you suggest. I see you rely a lot on PaddedMatrices.jl but how much of this do you think should be on the Bayesian inference side, versus the user-side by their choice of input array types that dispatch to the appropriate methods? Finally, would you be interested in joining forces?