Package scope and comparison to Turing

Hi Chris!

Thanks for this cool package. I wonder what the intended scope of this package is, and how it compares to something like Turing.jl. Also if you have some performance improvement ideas for Turing.jl, I would be interested to hear them. Note that there is a lot of interest now in the Turing team to improve its performance beyond simple type stability and Julia gotchas and your work here seems quite relevant. Turing is now very close to being fully type stable and Chris Rackauckas already reports a 15x speedup in his DiffEqBayes which uses Turing.

Eliminating all unnecessary allocations and enabling the full use of static arrays is also something on my radar, but I am curious if there are other tricks that you suggest. I see you rely a lot on PaddedMatrices.jl but how much of this do you think should be on the Bayesian inference side, versus the user-side by their choice of input array types that dispatch to the appropriate methods? Finally, would you be interested in joining forces?

Hi Mohamed!

I just added a README, where I briefly go into how the library works.

Turing is now very close to being fully type stable and Chris Rackauckas already reports a 15x speedup in his DiffEqBayes which uses Turing.

Cool! I remember once seeing benchmarks suggesting Turing was 10x slower than Stan. Is its performance now rather competitive?

Eliminating all unnecessary allocations and enabling the full use of static arrays is also something on my radar, but I am curious if there are other tricks that you suggest. I see you rely a lot on PaddedMatrices.jl

PaddedMatrices are much faster than StaticArrays, whether or not they actually have their namesake padding. Explicitly avoiding padding (parameters are nrows, ncols, element type, stride; if stride == nrows, then there is no padding):

julia> using PaddedMatrices, StaticArrays, BenchmarkTools, LinearAlgebra, Random

julia> pA = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); randn!(pA); #to avoid padding

julia> pB = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); randn!(pB); #to avoid padding

julia> pC = MutableFixedSizePaddedMatrix{7,7,Float64,7}(undef); 

julia> sA = SMatrix{7,7}(pA);

julia> sB = SMatrix{7,7}(pB);

julia> @btime $sA * $sB
  67.005 ns (0 allocations: 0 bytes)
7×7 SArray{Tuple{7,7},Float64,2,49} with indices SOneTo(7)×SOneTo(7):
  2.96419    4.83028   -2.11728      1.00643   -5.69249    4.66337   2.24875 
 -0.446299  -1.35403    0.00484316  -0.323731  -0.155002  -1.72822   2.07241 
  1.43737   -0.531415  -3.50058     -0.626386  -3.66252    3.2198    0.504351
  1.35284   -1.06179   -2.70543     -2.05564   -2.45683    2.49386   0.958187
  1.14669   -1.48968   -3.22534     -1.49174   -2.56398    2.87688  -0.840148
  1.33924    2.57827    1.07435      7.49071    6.52194    1.27062  -0.253103
 -2.73455   -0.357343   3.97818     -4.22651   -0.354159  -4.7175   -1.29675 

julia> @btime mul!($pC, $pA, $pB); pC
  17.997 ns (0 allocations: 0 bytes)
7×7 MutableFixedSizePaddedArray{Tuple{7,7},Float64,2,7,49}:
  2.96419    4.83028   -2.11728      1.00643   -5.69249    4.66337   2.24875 
 -0.446299  -1.35403    0.00484316  -0.323731  -0.155002  -1.72822   2.07241 
  1.43737   -0.531415  -3.50058     -0.626386  -3.66252    3.2198    0.504351
  1.35284   -1.06179   -2.70543     -2.05564   -2.45683    2.49386   0.958187
  1.14669   -1.48968   -3.22534     -1.49174   -2.56398    2.87688  -0.840148
  1.33924    2.57827    1.07435      7.49071    6.52194    1.27062  -0.253103
 -2.73455   -0.357343   3.97818     -4.22651   -0.354159  -4.7175   -1.29675 

julia> pA = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); randn!(pA); #to avoid padding

julia> pB = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); randn!(pB); #to avoid padding

julia> pC = MutableFixedSizePaddedMatrix{23,23,Float64,23}(undef); 

julia> sA = SMatrix{23,23}(pA);

julia> sB = SMatrix{23,23}(pB);

julia> @btime $sA * $sB
  1.404 μs (0 allocations: 0 bytes)
23×23 SArray{Tuple{23,23},Float64,2,529} with indices SOneTo(23)×SOneTo(23):
   0.676137  -11.5472    -4.29015     -5.12626   -6.57682    7.23703    9.99614     1.76468     1.25541   …  -2.59669    -5.0571    -0.214485  -0.683722   -1.23128      1.8082    -1.67359      7.54335 
  -4.13968    -5.58828    3.71343    -13.4674     4.54442   -1.58558   -3.78945    -0.200756    3.10997       5.18093    -5.25095   -5.6478     3.95834    -6.12274     -0.548574  -3.54803     -0.856037
 -11.2188    -12.0339    -6.31261      5.65743    1.7753    -6.29977   -0.461274   10.0776     -5.18723       3.58942    -6.23296   -0.321895  -3.41774    12.1922       6.50792   -5.91061      2.43976 
  -0.184734   -5.50193    2.33278     -8.37733    1.12845    0.951583  -4.23231    -2.21098    -2.55599       6.56053    -4.02659  -10.9366     2.20055    -8.96011      0.196862  -4.49699     -0.793005
   3.12704     2.14365    5.85391      3.2093     1.50542   -5.46851    1.81043    -3.89439     2.17621       1.07769     7.80411   13.9072     8.97308    10.7278      -7.19291   -2.52473     -6.58659 
   2.44971     0.73773   -1.79319     10.0336    -6.22129    0.31953    0.783538   -6.55799     1.31856   …  -3.41797    -1.6813     5.77539   -3.08049     1.66235      1.25254    7.00309      2.6951  
 -11.154       6.43822    2.14318      5.81669    2.03786    1.15025   -2.11491     4.13416     8.00949      -5.1131      3.3923     0.778615   0.679363    1.86173      1.83339    1.46648      6.09545 
   5.98668     5.10285    5.45983     -1.18721   -3.70755    7.0925     5.103      -2.27621     3.84499      -0.781866   -3.23418   -2.3769    -2.37448    -3.30754      0.933589  -2.75913      0.364206
 -13.4849     -2.86356    3.653        3.56056    0.322314   1.27395    1.4939     -1.7558      9.98444       3.24293   -15.4737    -9.07452   13.5957     -2.15837      3.18083    0.976493     4.45786 
  -1.0082     -3.38966   -4.02155     -5.5053    -5.37906   -0.376008  -0.725058   -3.56403    -1.8769        1.8387     -2.37666   -5.56774   -1.97787     0.797312    -4.13205   -0.787578    -1.83671 
   1.75811     8.66197    7.20415     -1.34933   -5.1398    -1.08476   -0.984257   -7.62245     6.51987   …  -6.01005    10.8171     1.81885   11.1168     -3.42523     -4.10206    1.99251     -3.41132 
   0.112358   -6.47499    0.316        0.831317  -8.86483    4.47975    7.09558    -7.69581     5.89953       1.59308   -11.7264    -0.805044   3.32148    -1.31097      3.71042    2.3723       1.2146  
  -3.47975   -10.0731     2.89155      0.404549   1.46972   -4.84939    7.18921     5.47791     2.59437      -0.9747     -5.06334    3.17199    1.14994     0.0592599   10.4362     4.56883     -3.82712 
  -6.85618    -7.40577   -6.59401      5.38784   -5.41918    2.10527    2.45604    -0.0598362   0.916278     -0.824134   -4.72543    4.11253   -3.92668     6.14167      4.72191    0.0359086   -0.280309
   1.18071    -4.55732   -4.25194      3.94757    6.57966   -3.3513     1.16814    -1.839      -7.05042       7.78813    -3.62933   -5.2294    -0.56637    -5.63994      0.243361  -0.88296     -0.852351
  -5.1927      1.47331    1.0722       2.34575   -0.763959  -3.91154    2.47275    -1.25566     5.29109   …  -3.73648    -3.51496   -1.11513    0.0217914   2.44424     -4.29189    1.04555     -0.91635 
   2.23114    -3.24984    3.38598     -6.09565    3.0708    -3.3247     0.311149    4.43763    -2.23465       0.223676    1.95088    4.8373    -3.34843    -3.04138     -1.6914     7.67875      1.10271 
  -5.53478    -4.13183   -3.21564     -0.276751  -0.600247  -1.34508    4.92285    -0.811928    3.37869       1.01024    -2.3416    -1.61964    2.57656     4.31843      1.71025    4.76718     -1.24788 
   5.00821   -11.9721    -9.81844     -0.501608   2.79921   -1.13599    7.97383     1.7627     -1.22191      -0.620024    5.66534    2.76723    6.48373     1.60909     -5.19443   -0.565687     6.81647 
  -3.93591     5.5759     2.85072      0.615663   2.40643   10.4825     4.17097    -0.922532   11.3389       -2.8813     -3.35798    2.70551    1.72967     3.04221      9.47624   -2.58001     10.8072  
  -4.28853     2.70421   -3.46872     -3.06659    6.95763   -6.55236    0.0519248   6.77244    -1.61432   …  -5.24371    10.8945    15.6312     5.20904    11.029        1.48784    1.90943     -4.33497 
   0.603864   -8.96422    8.12744     -2.28824   -7.73699    1.27218    7.00939    -2.23641     7.92207      -2.33262    -2.84921   -0.131382   2.8406     -4.33235     -3.18134    0.814096     5.58209 
  -0.471804   -0.644547  -0.0565805   -5.42696    1.46038   -9.56554   -5.56412    -7.40805    -4.75961       6.8637      1.74527   -7.12773    6.29815     0.736255   -15.2387    -2.97371    -10.2457  

julia> @btime mul!($pC, $pA, $pB); pC
  273.590 ns (0 allocations: 0 bytes)
23×23 MutableFixedSizePaddedArray{Tuple{23,23},Float64,2,23,529}:
   0.676137  -11.5472    -4.29015     -5.12626   -6.57682    7.23703    9.99614     1.76468     1.25541   …  -2.59669    -5.0571    -0.214485  -0.683722   -1.23128      1.8082    -1.67359      7.54335 
  -4.13968    -5.58828    3.71343    -13.4674     4.54442   -1.58558   -3.78945    -0.200756    3.10997       5.18093    -5.25095   -5.6478     3.95834    -6.12274     -0.548574  -3.54803     -0.856037
 -11.2188    -12.0339    -6.31261      5.65743    1.7753    -6.29977   -0.461274   10.0776     -5.18723       3.58942    -6.23296   -0.321895  -3.41774    12.1922       6.50792   -5.91061      2.43976 
  -0.184734   -5.50193    2.33278     -8.37733    1.12845    0.951583  -4.23231    -2.21098    -2.55599       6.56053    -4.02659  -10.9366     2.20055    -8.96011      0.196862  -4.49699     -0.793005
   3.12704     2.14365    5.85391      3.2093     1.50542   -5.46851    1.81043    -3.89439     2.17621       1.07769     7.80411   13.9072     8.97308    10.7278      -7.19291   -2.52473     -6.58659 
   2.44971     0.73773   -1.79319     10.0336    -6.22129    0.31953    0.783538   -6.55799     1.31856   …  -3.41797    -1.6813     5.77539   -3.08049     1.66235      1.25254    7.00309      2.6951  
 -11.154       6.43822    2.14318      5.81669    2.03786    1.15025   -2.11491     4.13416     8.00949      -5.1131      3.3923     0.778615   0.679363    1.86173      1.83339    1.46648      6.09545 
   5.98668     5.10285    5.45983     -1.18721   -3.70755    7.0925     5.103      -2.27621     3.84499      -0.781866   -3.23418   -2.3769    -2.37448    -3.30754      0.933589  -2.75913      0.364206
 -13.4849     -2.86356    3.653        3.56056    0.322314   1.27395    1.4939     -1.7558      9.98444       3.24293   -15.4737    -9.07452   13.5957     -2.15837      3.18083    0.976493     4.45786 
  -1.0082     -3.38966   -4.02155     -5.5053    -5.37906   -0.376008  -0.725058   -3.56403    -1.8769        1.8387     -2.37666   -5.56774   -1.97787     0.797312    -4.13205   -0.787578    -1.83671 
   1.75811     8.66197    7.20415     -1.34933   -5.1398    -1.08476   -0.984257   -7.62245     6.51987   …  -6.01005    10.8171     1.81885   11.1168     -3.42523     -4.10206    1.99251     -3.41132 
   0.112358   -6.47499    0.316        0.831317  -8.86483    4.47975    7.09558    -7.69581     5.89953       1.59308   -11.7264    -0.805044   3.32148    -1.31097      3.71042    2.3723       1.2146  
  -3.47975   -10.0731     2.89155      0.404549   1.46972   -4.84939    7.18921     5.47791     2.59437      -0.9747     -5.06334    3.17199    1.14994     0.0592599   10.4362     4.56883     -3.82712 
  -6.85618    -7.40577   -6.59401      5.38784   -5.41918    2.10527    2.45604    -0.0598362   0.916278     -0.824134   -4.72543    4.11253   -3.92668     6.14167      4.72191    0.0359086   -0.280309
   1.18071    -4.55732   -4.25194      3.94757    6.57966   -3.3513     1.16814    -1.839      -7.05042       7.78813    -3.62933   -5.2294    -0.56637    -5.63994      0.243361  -0.88296     -0.852351
  -5.1927      1.47331    1.0722       2.34575   -0.763959  -3.91154    2.47275    -1.25566     5.29109   …  -3.73648    -3.51496   -1.11513    0.0217914   2.44424     -4.29189    1.04555     -0.91635 
   2.23114    -3.24984    3.38598     -6.09565    3.0708    -3.3247     0.311149    4.43763    -2.23465       0.223676    1.95088    4.8373    -3.34843    -3.04138     -1.6914     7.67875      1.10271 
  -5.53478    -4.13183   -3.21564     -0.276751  -0.600247  -1.34508    4.92285    -0.811928    3.37869       1.01024    -2.3416    -1.61964    2.57656     4.31843      1.71025    4.76718     -1.24788 
   5.00821   -11.9721    -9.81844     -0.501608   2.79921   -1.13599    7.97383     1.7627     -1.22191      -0.620024    5.66534    2.76723    6.48373     1.60909     -5.19443   -0.565687     6.81647 
  -3.93591     5.5759     2.85072      0.615663   2.40643   10.4825     4.17097    -0.922532   11.3389       -2.8813     -3.35798    2.70551    1.72967     3.04221      9.47624   -2.58001     10.8072  
  -4.28853     2.70421   -3.46872     -3.06659    6.95763   -6.55236    0.0519248   6.77244    -1.61432   …  -5.24371    10.8945    15.6312     5.20904    11.029        1.48784    1.90943     -4.33497 
   0.603864   -8.96422    8.12744     -2.28824   -7.73699    1.27218    7.00939    -2.23641     7.92207      -2.33262    -2.84921   -0.131382   2.8406     -4.33235     -3.18134    0.814096     5.58209 
  -0.471804   -0.644547  -0.0565805   -5.42696    1.46038   -9.56554   -5.56412    -7.40805    -4.75961       6.8637      1.74527   -7.12773    6.29815     0.736255   -15.2387    -2.97371    -10.2457

and not just matrix multiplication:

julia> mC = MMatrix{23,23,Float64}(undef);

julia> @benchmark @. $mC = exp($sA) + $sB
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.600 μs (0.00% GC)
  median time:      2.605 μs (0.00% GC)
  mean time:        2.611 μs (0.00% GC)
  maximum time:     4.084 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark @. $pC = exp($pA) + $pB
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     517.563 ns (0.00% GC)
  median time:      531.505 ns (0.00% GC)
  mean time:        536.259 ns (0.00% GC)
  maximum time:     2.023 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     192

Even without padding, PaddedMatrices uses a lot of performance optimizations. Although, for multiplication with larger matrices, LAPACK, etc, it'll just use Julia's BLAS library.

how much of this do you think should be on the Bayesian inference side, versus the user-side by their choice of input array types that dispatch to the appropriate methods?

One of ProbabilityModel's performance optimizations is to (pre)allocate a chunk of memory on initialization, and then pass a pointer to this memory to supported functions. Those functions can use the pointer for working memory and/or for allocating the objects they return. In the latter case, they'd increment the pointer.

This currently relies on the PtrArray and DynamicPtrArray types from PaddedArrays.

If the pointer isn't incremented, that memory gets to pull double-duty, potentially letting the cache be more "hot" than it would be if you defined some structure that holds every individual array/preallocated structure (a named tuple for type stability?).

I can't argue with flexibility. But if you'd like to implement an optimizing front end, some integration between libraries like this helps IMO.

As for other tricks: the two biggest are that it performs source to source autodiff (avoiding the overhead many AD libraries seem to apply), and defining lots of high level adjoints. I think the latter makes a far bigger difference, and would also be compatible with using different AD libraries, like Zygote. It's probably common for the majority of runtime to be spent inside things like a multivariate normal logpdf. So for those models, defining optimized implementations for Zygote/other AD libraries will get you most of the way to optimal.

I see Turing already has a stdlib/distributions.jl, so I think that'd be a natural extension. Or do you already have something like it somewhere else?

Finally, would you be interested in joining forces?

Sure, I'm interested, and same for working with cscherrer (who I can't seem to tag, I guess because he hasn't participated in this repository). I will be busy for at least the next month, so I'll unfortunately be short on time.

But if we can agree on some strategy, not duplicating our efforts will help us all move forward.

How easily do you think Turing could be adapted to take advantage of the performance tricks I outlined here?

Cool! I remember once seeing benchmarks suggesting Turing was 10x slower than Stan. Is its performance now rather competitive?

We are still working on exact benchmarks but the final benchmarks will be different after https://github.com/TuringLang/Turing.jl/pull/826 and https://github.com/TuringLang/Turing.jl/pull/793 are merged. Hopefully, these will be ready by JuliaCon. But I can guarantee you we are way more than 10x faster than before!

PaddedMatrices are much faster than StaticArrays, whether or not they actually have their namesake padding.

This is a very pleasant surprise to me. I would definitely love to explore the use of other array types in Turing to optimize access pattern. See this issue for more details https://github.com/TuringLang/Turing.jl/issues/825 and this post https://discourse.julialang.org/t/mcmc-landscape/25654/21 for my view of the main components of Turing. The book-keeping struct or what we call VarInfo in Turing is where a lot of performance magic can happen for small- and medium-sized models.

One of ProbabilityModel's performance optimizations is to (pre)allocate a chunk of memory on initialization, and then pass a pointer to this memory to supported functions. Those functions can use the pointer for working memory and/or for allocating the objects they return. In the latter case, they'd increment the pointer.

We are still not there yet in Turing. Aggressive pre-allocations haven't been done yet, mostly because it wasn't the biggest perf improvement we could work on, and because Turing has quite a few moving parts, e.g. AdvancedHMC. I think this also requires a bottom-up approach where models defined by users should make use of pre-allocations and immutables where possible, sampling algorithms should enable full pre-allocation, then the @model API and parameter book-keeping can also do the same.

As for other tricks: the two biggest are that it performs source to source autodiff (avoiding the overhead many AD libraries seem to apply), and defining lots of high level adjoints. I see Turing already has a stdlib/distributions.jl, so I think that'd be a natural extension. Or do you already have something like it somewhere else?

Yes, we implement adjoints for various distributions mostly as workarounds for C libraries called by Distributions.jl. But more aggressive adjoints for high level functions are also possible to include if it helps performance. We currently support 2 AD backends: ForwardDiff and Tracker. @willtebbutt also recently used Zygote to define adjoints for Tracker.TrackedArray for MvNormal which was a neat integration of 2 AD backends. For reversediff, we still use Tracker because it supports mutation through TrackedReal even if it is not super performant, and it is generally less crash-prone compared to Zygote. That said, a Zygote support PR was also started in https://github.com/TuringLang/Turing.jl/pull/783, and it is very compact since we have a nice abstraction for various AD backends. More use of Zygote can be made either through Tracker or independently if we figure out a way to work around mutation.

How easily do you think Turing could be adapted to take advantage of the performance tricks I outlined here?

So there are 2 sides to the performance story in Turing:

Models allow custom user-defined functions and data structures, so these need to be performant. If the user does type unstable stuff, it will still work but not performantly.
The book-keeping of random variables and other metadata needs to be performant. In models where logpdf is the heaviest part, this will be mostly negligible as AD will take most of the time. However, for small problems, this book-keeping can add significant overhead and eliminating it can give significant speedup. One step along this line is discussed in https://github.com/TuringLang/Turing.jl/issues/825. Using PaddedArrays instead of StaticArrays is an interesting integration of our approaches.

But if we can agree on some strategy, not duplicating our efforts will help us all move forward.

Perhaps we can have a meeting with the other Turing members and Chad to discuss future plans. CC: @yebai, @xukai92, @willtebbutt, @trappmartin, @cpfiffer. Referring to https://discourse.julialang.org/t/mcmc-landscape/25654/21, it seems to me that you are experimenting with a different book-keeping strategy for models which is definitely something you can directly implement in Turing or port it over after it proves successful.

An ideal Turing for me wrt to usability and performance:

Has a nice friendly ~-based API that doesn't restrict user-defined functions and that doesn't change with the choice of the inference algorithm, e.g. particle samplers vs HMC
Has a nice plug-and-play approach for different inference algorithms
Allows the specialization of the book-keeping approach for different model types, e.g. static/dynamic, deterministic/stochastic, to unlock specific performance optimizations
Aggressively inlines where it makes sense to allow for the Julia compiler to do its magic
Allows full pre-allocation of memory
Allows the user to write custom code in their models and define custom AD adjoints and distributions
Enables full use of various modes of parallelism, e.g. multi-threading, GPU and distributed parallelism. We already have distributed sampling but not fully in the model evaluation yet. See this issue https://github.com/TuringLang/Turing.jl/issues/815 for GPU support discussions. Multi-threading is also possible to support through an array type and dispatch but maybe after the partr stuff is usable in Julia.

So the goal is to enable the following 3 types of optimizations and not get in the way:

User-pioneered optimizations,
Model-specific optimizations in book-keeping, and
Julia compiler optimizations of the entire sampling and AD

I hope this clarifies things from the Turing side.

I really like the idea of PPL interested people joining force. This way everyone can more easily focus on the parts one is interested in. I think it would be very beneficial if we could work together and have an initial Skype conference.

As a side note. We are extremely fortunate to have Mohamed in the team as most of the people are not necessarily people that are very good in writing efficient Julia code but are more familiar with Bayesian inference and advanced topics of Bayesian statistics. Thus, getting more people familiar with efficient implementations in Julia being involved with Turing would definitely be a great thing.

@trappmartin thanks for your kind words :)

We are still working on exact benchmarks but the final benchmarks will be different after TuringLang/Turing.jl#826 and TuringLang/Turing.jl#793 are merged. Hopefully, these will be ready by JuliaCon. But I can guarantee you we are way more than 10x faster than before!

That is great to hear.

This is a very pleasant surprise to me. I would definitely love to explore the use of other array types in Turing to optimize access pattern.

We can use StaticArrays.MArray for the fields of TypedVarInfo since it won't need to grow in size.

I've done a whole lot of performance work with PaddedMatrices.jl. Even a very basic example like a simple @inbounds @fastmath for loop to minimize mystery

julia> using PaddedMatrices, StaticArrays, BenchmarkTools

julia> pa = @btime @Mutable randn(32,32);
  1.588 μs (1 allocation: 8.06 KiB)

julia> ma = @btime @MMatrix randn(32,32);
  4.729 μs (1 allocation: 8.06 KiB)

They are both mutable structs wrapping an NTuple. They are fully compatible in terms of memory layout, so we can even just

julia> pa.data = ma.data;

julia> [pa[i,j] == ma[i,j] for i in 1:32, j in 1:32 ] # broadcasting isn't reliable at the moment...
32×32 Array{Bool,2}:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

yet, massive difference in performance even for such a simple function:

julia> function my_sum(x)
           s = zero(eltype(x))
           @inbounds @fastmath for i in 1:length(x)
               s += x[i]
           end
           s
       end
my_sum (generic function with 1 method)

julia> @btime my_sum($ma)
  876.833 ns (0 allocations: 0 bytes)
21.33859997410549

julia> @btime my_sum($pa)
  31.660 ns (0 allocations: 0 bytes)
21.33859997410542

As implied by the times, we have very different assembly (MMatrices are not vectorized by the compiler, but PaddedMatrices are):

julia> @code_native my_sum(ma)
    .text
; ┌ @ REPL[31]:2 within `my_sum'
    vxorpd  %xmm0, %xmm0, %xmm0
    xorl    %eax, %eax
    nopw    %cs:(%rax,%rax)
; │ @ REPL[31]:4 within `my_sum'
; │┌ @ fastmath.jl:161 within `add_fast'
L16:
    vaddsd  (%rdi,%rax,8), %xmm0, %xmm0
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
    addq    $1, %rax
    cmpq    $1024, %rax             # imm = 0x400
; │└└
    jne L16
; │ @ REPL[31]:6 within `my_sum'
    retq
    nopw    %cs:(%rax,%rax)
    nopl    (%rax)
; └

julia> @code_native my_sum(pa)
    .text
; ┌ @ REPL[31]:2 within `my_sum'
    vxorpd  %xmm0, %xmm0, %xmm0
    xorl    %eax, %eax
    vxorpd  %xmm1, %xmm1, %xmm1
    vxorpd  %xmm2, %xmm2, %xmm2
    vxorpd  %xmm3, %xmm3, %xmm3
    nopw    %cs:(%rax,%rax)
    nopl    (%rax)
; │ @ REPL[31]:4 within `my_sum'
; │┌ @ fastmath.jl:161 within `add_fast'
L32:
    vaddpd  (%rdi,%rax,8), %zmm0, %zmm0
    vaddpd  64(%rdi,%rax,8), %zmm1, %zmm1
    vaddpd  128(%rdi,%rax,8), %zmm2, %zmm2
    vaddpd  192(%rdi,%rax,8), %zmm3, %zmm3
    addq    $32, %rax
    cmpq    $1024, %rax             # imm = 0x400
    jne L32
    vaddpd  %zmm0, %zmm1, %zmm0
    vaddpd  %zmm0, %zmm2, %zmm0
    vaddpd  %zmm0, %zmm3, %zmm0
    vextractf64x4   $1, %zmm0, %ymm1
    vaddpd  %zmm1, %zmm0, %zmm0
    vextractf128    $1, %ymm0, %xmm1
    vaddpd  %zmm1, %zmm0, %zmm0
    vpermilpd   $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
    vaddpd  %xmm1, %xmm0, %xmm0
; │└
; │ @ REPL[31]:6 within `my_sum'
    vzeroupper
    retq
    nopw    %cs:(%rax,%rax)
    nop
; └

Summing the MMatrix is not vectorized. The compiler simply loops over vaddsd (vadd a single double), instead of looping over 4 vaddpd (vadd a packed vector of doubles); 4 is to take advantage of out of order execution / address dependency chains and instruction latency. (The for loops are the part that say

L[some number]:
    a bunch of code
    cmpq # compare two integers
    jne L[that number] # jump back to location [that number] if comparison was not equal.

) Note that the only functions getting called on the arrays are length and getindex! MMatrices simply don't get along with Julia's autovectorizer at all. You'll get basically the same performance and assembly with a regular matrix and for i in 1:1024:

julia> aa = Array(pa);

julia> @btime my_sum1024($aa) # replaced `for i in 1:length(x)` with `for i in 1:1024`
  31.105 ns (0 allocations: 0 bytes)
21.33859997410542

julia> @code_native my_sum1024(aa)
    .text
; ┌ @ REPL[37]:2 within `my_sum1024'
    movq    (%rdi), %rax
    vxorpd  %xmm0, %xmm0, %xmm0
    xorl    %ecx, %ecx
    vxorpd  %xmm1, %xmm1, %xmm1
    vxorpd  %xmm2, %xmm2, %xmm2
    vxorpd  %xmm3, %xmm3, %xmm3
    nopw    %cs:(%rax,%rax)
    nop
; │ @ REPL[37]:4 within `my_sum1024'
; │┌ @ fastmath.jl:161 within `add_fast'
L32:
    vaddpd  (%rax,%rcx,8), %zmm0, %zmm0
    vaddpd  64(%rax,%rcx,8), %zmm1, %zmm1
    vaddpd  128(%rax,%rcx,8), %zmm2, %zmm2
    vaddpd  192(%rax,%rcx,8), %zmm3, %zmm3
    addq    $32, %rcx
    cmpq    $1024, %rcx             # imm = 0x400
    jne L32
    vaddpd  %zmm0, %zmm1, %zmm0
    vaddpd  %zmm0, %zmm2, %zmm0
    vaddpd  %zmm0, %zmm3, %zmm0
    vextractf64x4   $1, %zmm0, %ymm1
    vaddpd  %zmm1, %zmm0, %zmm0
    vextractf128    $1, %ymm0, %xmm1
    vaddpd  %zmm1, %zmm0, %zmm0
    vpermilpd   $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
    vaddpd  %xmm1, %xmm0, %xmm0
; │└
; │ @ REPL[37]:6 within `my_sum1024'
    vzeroupper
    retq
    nopw    %cs:(%rax,%rax)
    nop
; └

So this at least is nothing special about PaddedMatrices; it (like most Julia objects) simply is not getting in the way of Julia's autovectorizer, while StaticArrays.MArrays do.

With the 2 optimizations in-place and a whole bunch of inlining, it may be possible for the Julia compiler to avoid allocating any memory on the heap for TypedVarInfo! The Julia compiler is good at eliding allocations when mutable objects only live within a single function body. This means if we make sure all the functions which use vi::TypedVarInfo are properly inlined, the compiler may be able to elide these allocations entirely. Unfortunately, this is not true for Base.Array since these are implemented in C so they have special semantics.

I've heard that a bunch of inlining isn't necessarily optimal for performance, and I intend to move away from that route once this issue is resolved. To be able to avoid inlining everything, I'm relying on having ProbabilityModels.jl manually manage all the memory needed with its own stack. The "preallocation" comes with a malloc call when the library is loaded. Currently, the malloc is 1GiB, but I'm thinking of increasing it. It seems so far that the OS will only eat up as much memory as you've actually touched, so picking big numbers seems safe / does not actually eat system resources.

Regarding VarInfo itself is there some succinct summary about what it does / what bookkeeping you have to do?

I think my equivalent is that defining a model named Foo defines a struct with name Foo, and a field for every variable in the model that doesn't look like a function and isn't on the left hand side of an equal sign. Those variables are considered "unknowns": they must either be data/priors or parameters. The struct is parameterized by the type of every field.

One can then create an instance of the struct, using key word arguments to define each of the unknowns as either particular objects (eg, y = 3), or as a type (σ = PositiveFloat) of known size (ie, PositiveFloat is of length 1, RealVector{23} is of length 23, LKJCorrCholesky{9} is of length (9*8)>>1). Fields given non-type arguments are now considered known data/priors. Fields given types are assumed to be unknown parameters, with type equal to that unknown type.

Given a vector with length equal to the total length of all parameters, the generated logdensity/logdensity + gradient functions simply load all the parameters from the vector (adding required Jacobians and their gradients), assigning them to variables with the parameters name. That is, it would add something equivalent to

mangled_name = @inbounds mangled_param_vector_name[mangled_some_index]
σ = Base.exp(mangled_name)
target += mangled_name # log jacobian

(mangling is just referring to ensuring hygiene; target is the same as in Stan)

Julia's compiler obviously has no problem figuring out types from code like this, or optimizing it. σ is now defined inside the function body, so any references to σ in the model find it in the local scope and use it. It is a boring old variable, like any other, without the need for bookkeeping. The order someone accesses the loaded parameters down the line doesn't matter (for validity), but because it does matter for performance (ie, keep cache hot, or even let things be stored in registers instead of the stack when possible), an eventual optimization might be to better organize the generated function, placing loads closer to first references.

That the mental model of what is going on in the generated functions is so familiar -- ie, exactly what goes on in code we write directly -- is why I keep gravitating to them. No worries about "will the compiler be able to elide this, can it be coerced into optimizing that", etc. Just make the generated function generate the code you would want to write.

It would be an overstatement to say ProbabilityModels only supports static models in the moment, in that saying this would imply it works reliably enough to say it supports anything. Once the library is more mature, it will be static for quite some time. if/else statements and loops will take a while. I'm going to have to figure out what to do with code like this (I Meta.lower user's models as one of the first processing steps):

julia> Meta.@lower for i ∈ 1:5
           target += p[i]
       end
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1  = 1:5
│         #s9 = Base.iterate(%1)
│   %3  = #s9 === nothing
│   %4  = Base.not_int(%3)
└──       goto #4 if not %4
2 ┄ %6  = #s9
│         i = Core.getfield(%6, 1)
│   %8  = Core.getfield(%6, 2)
│  └
│   @ REPL[20]:2 within `top-level scope`
│   %9  = target
│   %10 = Base.getindex(p, i)
│         target = %9 + %10
│         #s9 = Base.iterate(%1, %8)
│   %13 = #s9 === nothing
│   %14 = Base.not_int(%13)
└──       goto #4 if not %14
3 ─       goto #2
4 ┄       return
))))

But I plan on letting someone hide dynamic pieces in another function, and just ForwardDiff (or Zygote, etc) it automatically if they don't provide a gradient.

Regarding PPL people joining forces, how about also looping in the Gen.jl team? @marcoct

They are doing some cool stuff, but taking a less conventional approach so there may or may not be enough intersection here, but if there is, it would be very powerful (and simplifying to end users).

Who in the Julia∩PPL community will be at JuliaCon? I'll be there for the week, should be a great time to sync

I think @cpfiffer and @willtebbutt are attending. Maybe @xukai92 too.

There's also going to be a Gen.jl talk

Could make sense to move this to Discourse?

@chriselrod thanks for the elaborate response. Sorry it took me a while to get back to you. I am officially on a holiday!

I am definitely sold on PaddedMatrices.jl. I will try to incorporate it in Turing's VarInfo somehow and see how much speedup I can get.

Regarding VarInfo itself is there some succinct summary about what it does / what bookkeeping you have to do?

The main complexity of VarInfo comes from it supporting many types of models and inference algorithms. For example, Turing supports dynamic models where the number of parameters can itself be random. Stochastic flow control is also supported so variables can be sampled in different orders. This requires, in the most general case, some sort of identifier for each variable and a dictionary that maps it to its values. The way we do this in Turing is using a NamedTuple mapping from parameter symbols to their metadata. Each symbol is a Julia scalar, array or array of arrays, and its size can be different in different iterations. This symbol can have multiple parameters, e.g. x[i] ~ Normal(i, 1.0) in a loop. The most basic form of metadata is the value of the parameters. Additionally, some inference algorithms, e.g. particle samplers, require additional information such as the relative order of parameter sampling wrt to the observations. For this, we keep additional information in the metadata of each symbol.

We also store the distribution of each random variable to enable the transformation of parameters from the constrained parameter space to the Euclidean space and back for Hamiltonian sampling. The distribution also helps us reconstruct the parameter value in its true form from the linearized values since we store the values in a single vector per symbol. Because of its breadth, VarInfo is a quite complicated data structure. However, one future plan is to make a specialized VarInfo for static models and enable different types of optimizations under more restricted assumptions.

Your pre-allocation approach is interesting and somewhat non-traditional as far as Julia packages are concerned IIUC. Ideally, I would like to have some form of pre-allocation but without having to manage the pointers explicitly as that may increase the complexity of the code. I would love to read ProbabilityModels.jl to understand your approach better. It may not be that hard after all!

Thanks for your great package once again, and good luck!

chriselrod / ProbabilityModels.jl

Package scope and comparison to Turing #6