XanaduAI / thewalrus

A library for the calculation of hafnians, Hermite polynomials and Gaussian boson sampling.
https://the-walrus.readthedocs.io
Apache License 2.0
99 stars 54 forks source link

Recursive and much faster Torontonian computation added #321

Closed GregoryMorse closed 2 years ago

GregoryMorse commented 2 years ago

Context:

Torontonian computation is less efficient than state of the art.

Description of the Change:

Recursive Torontonian is implemented also with numba for optimized code.

Based on paper: https://arxiv.org/pdf/2109.04528.pdf Polynomial speedup in Torontonian calculation by a scalable recursive algorithm Ágoston Kaposi, Zoltán Kolarovszki, Tamás Kozsik, Zoltán Zimborás, and Péter Rakyta

Benefits:

Recursive computation gives improvement from n^2.7355*2^n to n^1.0695*2^n, significant improvement in the polynomial time part of the computation.

Possible Drawbacks:

None, beyond making the library more complicated and introducing a new default value to use the new faster computation method.

Related GitHub Issues:

nquesada commented 2 years ago

Hi @GregoryMorse : This is a great addition! Wondering if you could add some plots benchmarking plotting the non-recursive and the recursive versions?

GregoryMorse commented 2 years ago

Hi @GregoryMorse : This is a great addition! Wondering if you could add some plots benchmarking plotting the non-recursive and the recursive versions?

Sure, are you referring to ones for documentation purposes e.g. https://github.com/XanaduAI/thewalrus/blob/master/docs/gallery/perm_benchmark.svg? Or any other pre-existing example you would like this to be based upon?

nquesada commented 2 years ago

Hi @GregoryMorse : yes something analogous to that figure but for both the new recursive torontonian and the old non-recursive torontonian, that way it is easier to judge from which size onward it is useful to use the recursive formula.

codecov[bot] commented 2 years ago

Codecov Report

Merging #321 (12cf4df) into master (dde8320) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #321   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           22        22           
  Lines         1572      1575    +3     
=========================================
+ Hits          1572      1575    +3     
Impacted Files Coverage Δ
thewalrus/__init__.py 100.00% <ø> (ø)
thewalrus/_torontonian.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dde8320...12cf4df. Read the comment docs.

nquesada commented 2 years ago

@GregoryMorse : in the process of adding # pragma no cover we made black unhappy. Feel free to pass it once more on the files and then your PR will pass all the checks.

GregoryMorse commented 2 years ago

Hi @GregoryMorse : yes something analogous to that figure but for both the new recursive torontonian and the old non-recursive torontonian, that way it is easier to judge from which size onward it is useful to use the recursive formula.

Might as well make a benchmarking tutorial that shows how to generate the picture e.g. copying from https://github.com/XanaduAI/thewalrus/blob/master/docs/gallery/hafnian_tutorial.ipynb ?

GregoryMorse commented 2 years ago

I see the torontonian test is seemingly not a test at all. tor is compared with numba_tor but tor also calls numba_tor. So it's redundant. Either a naive and slow one liner computation should be used. Or we can confirm by one test function which compares the recursive to the non recursive. Which is happening now at least as the default is changed to recursive. Making the old test useful and the new test redundant.

josh146 commented 2 years ago

This is a nice improvement @GregoryMorse!

image

Regarding this benchmark, it makes me wonder if we should go with the non-recursive algorithm for n=2, and the recursive algorithm otherwise, as the difference is quite significant? I'm just thinking that if there is an application which involves thousands of loops over 2x2 Torontonians, this could lead to a significant difference.

Then again, it feels a bit weird to change the default depending on the matrix size 🤔

nquesada commented 2 years ago

@josh146 : one possible way it to hard code the case of n=2 for all methods, just like we do for the hafnian wrapper function. One can easily write the exact answer for 2x2 matrix.

josh146 commented 2 years ago

Good idea!

GregoryMorse commented 2 years ago

@josh146 @nquesada It is looking like after really considering the n=2 case, there is nothing really going on there, no recursion, it should already be as efficient as needed. More than likely the numba JIT time was included and therefore the performance must be remeasured.

Also the code has some slight non-vectorized aspects to it, maybe micro-optimizations will give much more than a small constant-time performance boost.

Mainly these 2 lines of code are pretty silly, probably should follow the direct numpy method:

  Z = np.zeros(2*n, dtype=np.int_)
  for i in range(n): Z[2*i] = i; Z[2*i+1] = i+n

AFAIK the most efficient way of doing this is:

  Z = np.empty((2*n,), dtype=np.int_)
  Z[0::2] = np.arange(0, n)
  Z[1::2] = np.arange(n, 2*n)
GregoryMorse commented 2 years ago

The n=2 case is solved. It was just lack of care with the Numba JIT, should have suspected this, sorry for generating the extra discussion. So we are equal in n=2, and better for n>=4. I have updated the gallery pictures. I notice the system specs at the end of the Jupyter script are not working as I am on Windows, and !cat /proc/cpuinfo is not a nice portable solution for that. Is there such a nice portable solution? My solution for now is:

import os
if os.name == "nt":
    !wmic cpu get caption, deviceid, name, numberofcores, maxclockspeed, status
else:
    !cat /proc/cpuinfo|head -19 

I don't know any numpy tricks to eliminate the for loops in the quad Cholesky but it would be the only place to improve. But with numba, I think this is not particularly important as any vectorization should hopefully already be optimal. I have been in touch with the original author of the paper, who fixed a couple of minor mistakes and has shared an updated paper whose algorithm the PR code already agrees with (and the test results are consistent already regardless).

GregoryMorse commented 2 years ago

@GregoryMorse : in the process of adding # pragma no cover we made black unhappy. Feel free to pass it once more on the files and then your PR will pass all the checks.

You mean we should remove the pragmas and pass it? I will give it a go.

nquesada commented 2 years ago

Hi @GregoryMorse : I just mean that you should pass black in all your files.

GregoryMorse commented 2 years ago

@nquesada Thanks for your review -

2. why there are two methods and they produce different values

Just to clarify, different execution time values (not output values).

Also should the link be to the documentation or source code and on github or thewalrus webpage?

nquesada commented 2 years ago

@GregoryMorse : I meant different times, not different values! I think you should link to the paper on the arXiv: : https://arxiv.org/pdf/2109.04528.pdf

GregoryMorse commented 2 years ago

Great work @GregoryMorse . Left some language comments. Also have two suggestions:

  1. Make sure you commit a version that you run in a fresh kernel, that way the cell in the notebook will start in [1] instaed of [222].
  2. You might want to add a cell pointing to the original implementation of the torontonian and the linking to the new recursive methods. I don't think you need to explain them since it can get too technical, but you can certainly put a link so that the interested reader known where to look and understands why there are two methods and they produce different values.

Overall, great addition to The Walrus!

@nquesada I've done my best to address all your concerns.