Performance: no broadcasting; reduced evaluations

Some minor changes based on performance profiling using the tests/problems/test_square.py problem running only the ipxyz variant.

Measurements on 8d982e23666a42530ef44f1f5560cf9a7eee89e7

# n = 20, for 20 iterations
Benchmark #1: python tests/problems/test_square.py
  Time (mean ± σ):     362.2 ms ±   3.6 ms    [User: 457.6 ms, System: 1051.8 ms]
  Range (min … max):   359.3 ms … 370.7 ms    10 runs

# n = 2000, for 20 iterations
Benchmark #1: python tests/problems/test_square.py
  Time (mean ± σ):      1.319 s ±  0.010 s    [User: 1.462 s, System: 1.187 s]
  Range (min … max):    1.302 s …  1.333 s    10 runs

# n = 20000, for 20 iterations
Benchmark #1: python tests/problems/test_square.py
  Time (mean ± σ):      8.163 s ±  0.124 s    [User: 12.904 s, System: 31.740 s]
  Range (min … max):    7.954 s …  8.397 s    10 runs

Measurements on 50b8a5cfe168fe07360f8deb93642cd275ec6d33

# n = 20, for 20 iterations
Benchmark #1: python tests/problems/test_square.py
  Time (mean ± σ):     304.4 ms ±   3.0 ms    [User: 365.4 ms, System: 840.7 ms]
  Range (min … max):   300.5 ms … 310.0 ms    10 runs

# n = 2000, for 20 iterations
Benchmark #1: python tests/problems/test_square.py
  Time (mean ± σ):     967.5 ms ±   9.1 ms    [User: 1.116 s, System: 1.204 s]
  Range (min … max):   951.5 ms … 980.3 ms    10 runs

# n = 20000, for 20 iterations
Benchmark #1: python tests/problems/test_square.py
  Time (mean ± σ):      5.840 s ±  0.034 s    [User: 9.251 s, System: 22.692 s]
  Range (min … max):    5.796 s …  5.908 s    10 runs

The improvement ranges from 10-30% or so, depending on the problem size.

Changing np.broadcast_to to np.reciprocal within MMA avoids broadcasting and some indexing by combining the where= and out= optional arguments. The first call now allocates the arrays, leaving ~self.positive undefined, which is then updated by the second call in place. We could save some more allocations by storing these arrays within the classes, however, when doing that I ran into some bugs, so I have left that for another time for now.

Secondly, I now added g_dg and g_dg_ddg to bundle the computation of g, dg, ddg calls. The evaluation of the approximated problem is repeated in these, and it seems typical that all of them are computed at the same time. So, combining these calls saves some repeated evaluations. The code is not the cleanest, so we might want to think about that structure a bit more.

Finally, I also store the returned arrays by the problem, in this case Square. I will play around a bit more with this structure, as it would be great if we can layout the Problem in such a way that this is done automatically. For large problems it is quite costly to reallocate the design variables and sensitivity vectors every iteration (where I am assuming that Python does not do any smart tricks by reusing the allocations and is simply dropping the memory and reallocating. If it does do something smarter, this might have little benefits).

Maybe @artofscience and @aatmdelissen you can take a brief look. I will try to clean it up more before actually merging into dev though.

artofscience / SAOR

Performance: no broadcasting; reduced evaluations #40