Help: measuring performance

Strilanc commented 7 years ago

I'm having trouble measuring ProjectQ's performance. Something is causing a serious slowdown.

For example, I ran these commands from my terminal:

mkvirtualenv tmp
pip install pybind11
pip install projectq
python speed_test.py

Here are the contents of speed_test.py:

from __future__ import print_function

import time

from projectq.backends import Simulator
from projectq.cengines import MainEngine
from projectq.ops import X, H, Toffoli

def main():
    n = 256

    sim = Simulator()
    eng = MainEngine(backend=sim, engine_list=[])
    qubits = eng.allocate_qureg(3)

    for qubit_count in range(4, 20):
        qubits.append(eng.allocate_qubit())
        t = time.time()
        m = len(qubits)
        for i in range(n):
            a, b, c = qubits[i % m], qubits[(i+1) % m], qubits[(i+2) % m]
            Toffoli | (a, b, c)
            X | a
            H | a
        dt = time.time() - t
        print("{} gates/sec @ {} qubits".format(int(3*n/dt), len(qubits)))

if __name__ == "__main__":
    main()

And got these results:

199 gates/sec @ 4 qubits
192 gates/sec @ 5 qubits
190 gates/sec @ 6 qubits
194 gates/sec @ 7 qubits
183 gates/sec @ 8 qubits
199 gates/sec @ 9 qubits
198 gates/sec @ 10 qubits
200 gates/sec @ 11 qubits
206 gates/sec @ 12 qubits
193 gates/sec @ 13 qubits
195 gates/sec @ 14 qubits
183 gates/sec @ 15 qubits
195 gates/sec @ 16 qubits
190 gates/sec @ 17 qubits
196 gates/sec @ 18 qubits
184 gates/sec @ 19 qubits
Exception RuntimeError: 'Error: Qubit has not been measured / uncomputed! [...]
[...]

Those rates are terrible. I get higher performance with the python simulator up to 14 qubits:

(Note: This is the (slow) Python simulator.)
18598 gates/sec @ 4 qubits
16701 gates/sec @ 5 qubits
13797 gates/sec @ 6 qubits
9966 gates/sec @ 7 qubits
6365 gates/sec @ 8 qubits
3621 gates/sec @ 9 qubits
1983 gates/sec @ 10 qubits
1038 gates/sec @ 11 qubits
532 gates/sec @ 12 qubits
264 gates/sec @ 13 qubits
135 gates/sec @ 14 qubits
68 gates/sec @ 15 qubits
[...]

Last month when I speed-tested projectq, it was getting numbers similar to Quirk: 8000 gates/sec at 16 qubits. I'm not sure what would have changed in the meantime, but performance seems to have dropped by 50x.

I have confirmed in my own debugging that the line self._simulator.apply_controlled_gate seems to be the big offender, but I haven't figured out much more than that.

thomashaener commented 7 years ago

That's weird. Did you do export OMP_NUM_THREADS=#cores? Depending on the compiler (icc vs gcc, also version-dependent), threads are not kept alive and that causes a large slowdown.

damiansteiger commented 7 years ago

This is a short speed test on my notebook (battery power). It actually doesn't matter for my compiler too much if OMP_NUM_THREADS=4 or OMP_NUM_THREADS=8

Damians-MacBook-Pro:code Damian$ export OMP_NUM_THREADS=4
Damians-MacBook-Pro:code Damian$ python2.7 speed_test.py
13742 gates/sec @ 4 qubits
15134 gates/sec @ 5 qubits
15053 gates/sec @ 6 qubits
14618 gates/sec @ 7 qubits
14910 gates/sec @ 8 qubits
15114 gates/sec @ 9 qubits
15018 gates/sec @ 10 qubits
14857 gates/sec @ 11 qubits
14525 gates/sec @ 12 qubits
14234 gates/sec @ 13 qubits
13361 gates/sec @ 14 qubits
11900 gates/sec @ 15 qubits
10137 gates/sec @ 16 qubits
7969 gates/sec @ 17 qubits
5181 gates/sec @ 18 qubits
3055 gates/sec @ 19 qubits

Strilanc commented 7 years ago

export OMP_NUM_THREADS=4 makes a huge difference:

21677 gates/sec @ 4 qubits
21287 gates/sec @ 5 qubits
22056 gates/sec @ 6 qubits
19170 gates/sec @ 7 qubits
13135 gates/sec @ 8 qubits
18900 gates/sec @ 9 qubits
21962 gates/sec @ 10 qubits
21605 gates/sec @ 11 qubits
20964 gates/sec @ 12 qubits
20316 gates/sec @ 13 qubits
18732 gates/sec @ 14 qubits
16289 gates/sec @ 15 qubits
13201 gates/sec @ 16 qubits
6460 gates/sec @ 17 qubits
4014 gates/sec @ 18 qubits
2927 gates/sec @ 19 qubits

Given the huge difference in performance, is there a reason this isn't the default?

thomashaener commented 7 years ago

The OpenMP default is the number of available hardware threads; I don't know why this is the case.

Strilanc commented 7 years ago

If it's possible to detect this kind of misconfiguration and fixing it, we might want to consider doing that. But the export workaround does solve my particular issue with testing performance.

damiansteiger commented 7 years ago

You may also want to use export OMP_PROC_BIND=SPREAD to increase the simulator performance even more:

Damians-MacBook-Pro:code Damian$ export OMP_NUM_THREADS=4
Damians-MacBook-Pro:code Damian$ export OMP_PROC_BIND=SPREAD
Damians-MacBook-Pro:code Damian$ python2.7 speed_test.py
15714 gates/sec @ 4 qubits
15804 gates/sec @ 5 qubits
15121 gates/sec @ 6 qubits
15463 gates/sec @ 7 qubits
14696 gates/sec @ 8 qubits
14942 gates/sec @ 9 qubits
15318 gates/sec @ 10 qubits
14667 gates/sec @ 11 qubits
14639 gates/sec @ 12 qubits
13932 gates/sec @ 13 qubits
12428 gates/sec @ 14 qubits
11225 gates/sec @ 15 qubits
10220 gates/sec @ 16 qubits
8007 gates/sec @ 17 qubits
5537 gates/sec @ 18 qubits
3215 gates/sec @ 19 qubits

thomashaener commented 7 years ago

By the way, I think it should be qubits.extend(...) rather than append; not that it makes a difference :)

ProjectQ-Framework / ProjectQ

Help: measuring performance #107