Vectorized - Githubissues

I've added two new flags to the InSituAdam.fit() method that should speed up the computation in some cases. They are both set to False by default (= original formulation).

field_store: when set to True, the fields at the phase shifters are computed and stored during the initial forward pass; similarly, the adjoint fields are computed and stored during the backward pass. This is in contrast to the original version, where these fields are computed when compute_gradients() is called later on. This should always be faster, but for very large/dense networks it will run into memory problems earlier than the original formulation.
partial_vectors: when set to True, the MZI partial matrices are computed and stored as [N, 2] arrays instead of as [N, N] matrices. The algebra is a bit not obvious, but it works. This can actually be slightly slower for very small N (e.g. N = 4), but it leads to a huge advantage for large N.

To test the timing, I tried the following code:

N = 500
N_cl = 10
N_tot = 10;
x_tr = np.random.rand(N, N_tot)
y_tr = np.random.rand(N_cl, N_tot)
model_1layer = neu.Sequential([
    neu.ClementsLayer(N),
    neu.Activation(neu.Abs(N)),
    neu.DropMask(N, keep_ports=range(N_cl))
])

import cProfile, pstats, io
from pstats import SortKey
pr = cProfile.Profile()
pr.enable()

losses = neu.InSituAdam(model_1layer, neu.CategoricalCrossEntropy, step_size=0.005).fit(x_tr, 
                                    y_tr, epochs=2, batch_size=10, field_store=False, partial_vectors=False)

pr.disable()
s = io.StringIO()
sortby = SortKey.CUMULATIVE
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())

Which results in the following output (top lines only):

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000  221.617  110.809 /home/momchil/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2931(run_code)
      3/2    0.000    0.000  221.617  110.809 {built-in method builtins.exec}
        1    0.152    0.152  221.617  221.617 <ipython-input-7-012a63dc2924>:17(<module>)
        1    4.861    4.861  220.599  220.599 ../neuroptica/neuroptica/optimizers.py:136(fit)
        2    0.329    0.164  166.365   83.183 ../neuroptica/neuroptica/components/component_layers.py:440(compute_gradients)
     2000    4.855    0.002  161.485    0.081 ../neuroptica/neuroptica/components/component_layers.py:77(get_partial_transfer_matrices)
    13010  142.192    0.011  142.192    0.011 {built-in method numpy.core.multiarray.dot}
        2    0.116    0.058   82.816   41.408 ../neuroptica/neuroptica/components/component_layers.py:280(compute_phase_shifter_fields)
        2    0.124    0.062   82.742   41.371 ../neuroptica/neuroptica/components/component_layers.py:334(compute_adjoint_phase_shifter_fields)
        4    0.001    0.000   45.117   11.279 ../neuroptica/neuroptica/components/component_layers.py:263(get_transfer_matrix)

Huge amount of time spent on getting MZI partial matrices. Setting partial_vectors=True yields:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   67.725   33.863 /home/momchil/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2931(run_code)
        2    0.000    0.000   67.725   33.863 {built-in method builtins.exec}
        1    0.146    0.146   67.725   67.725 <ipython-input-8-40942594a05a>:17(<module>)
        1    4.823    4.823   66.905   66.905 ../neuroptica/neuroptica/optimizers.py:136(fit)
        4    0.001    0.000   43.964   10.991 ../neuroptica/neuroptica/components/component_layers.py:263(get_transfer_matrix)
        4   34.107    8.527   34.107    8.527 {built-in method _functools.reduce}
        2    0.000    0.000   22.727   11.364 ../neuroptica/neuroptica/models.py:45(forward_pass)
        2    0.007    0.003   22.727   11.363 ../neuroptica/neuroptica/layers.py:125(forward_pass)
        2    0.000    0.000   21.260   10.630 ../neuroptica/neuroptica/models.py:54(backward_pass)
        2    0.009    0.005   21.257   10.629 ../neuroptica/neuroptica/layers.py:136(backward_pass)
        2    0.412    0.206   13.861    6.930 ../neuroptica/neuroptica/components/component_layers.py:440(compute_gradients)
     2000    4.565    0.002   12.257    0.006 ../neuroptica/neuroptica/components/component_layers.py:122(get_partial_transfer_vectors)

A significant reduction of time: get_partial_transfer_vectors() takes 12.257s total, as opposed to 161.485s spent on get_partial_transfer_matrices() in the first case.

Finally, setting also field_store=True yields:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   23.236   11.618 /home/momchil/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2931(run_code)
        2    0.000    0.000   23.236   11.618 {built-in method builtins.exec}
        1    0.139    0.139   23.236   23.236 <ipython-input-9-58fb17ed423e>:17(<module>)
        1    4.781    4.781   22.421   22.421 ../neuroptica/neuroptica/optimizers.py:136(fit)
     2000    4.500    0.002   12.051    0.006 ../neuroptica/neuroptica/components/component_layers.py:122(get_partial_transfer_vectors)

This leads to another strong reduction in time because now get_transfer_matrix() is never called for the MZI components.

In the end there's an order of magnitude reduction in total time for this particular example.

I changed all the indents to 4 spaces. Re merging, there is no rush, I'm fine with keeping this on a separate branch (I just updated the nonlinearity file here to match the recent changes on master).

Re using sparse matrices, I actually tried this first. I've now pushed another branch called sparse in which I make the MZI partial matrices sparse. However, I gave up on that after I found that for small N, the sparse matrix creation has a huge overhead. For larger N, it seems better than using full matrices, but it also still seems slightly slower than the vectorized version. The same timed code as above but ran using the sparse branch (and no field_store which is not implemented there) yields:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   88.296   44.148 /home/momchil/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:2931(run_code)
      3/2    0.000    0.000   88.296   44.148 {built-in method builtins.exec}
        1    0.151    0.151   88.296   88.296 <ipython-input-5-012a63dc2924>:17(<module>)
        1    5.164    5.164   87.427   87.427 ../neuroptica/neuroptica/optimizers.py:136(fit)
        4    0.001    0.000   45.356   11.339 ../neuroptica/neuroptica/components/component_layers.py:221(get_transfer_matrix)
        4   34.288    8.572   34.288    8.572 {built-in method _functools.reduce}
        2    0.307    0.154   32.628   16.314 ../neuroptica/neuroptica/components/component_layers.py:363(compute_gradients)
     2000   13.630    0.007   31.387    0.016 ../neuroptica/neuroptica/components/component_layers.py:79(get_partial_transfer_matrices)
        2    0.000    0.000   22.919   11.459 ../neuroptica/neuroptica/models.py:46(forward_pass)
        2    0.004    0.002   22.918   11.459 ../neuroptica/neuroptica/layers.py:121(forward_pass)
        2    0.000    0.000   22.451   11.226 ../neuroptica/neuroptica/models.py:52(backward_pass)
        2    0.003    0.002   22.448   11.224 ../neuroptica/neuroptica/layers.py:126(backward_pass)
        2    0.079    0.039   16.352    8.176 ../neuroptica/neuroptica/components/component_layers.py:278(compute_adjoint_phase_shifter_fields)
        2    0.065    0.032   15.515    7.757 ../neuroptica/neuroptica/components/component_layers.py:239(compute_phase_shifter_fields)
        4    0.004    0.001   11.067    2.767 ../neuroptica/neuroptica/components/component_layers.py:222(<listcomp>)
     2000    1.251    0.001   11.050    0.006 ../neuroptica/neuroptica/components/component_layers.py:67(get_transfer_matrix)
   499000    6.945    0.000    8.311    0.000 ../neuroptica/neuroptica/components/components.py:87(get_transfer_matrix)

Compare to 67.725s total time using the vectorized run. Of course, there's the possibility that I haven't implemented the sparse matrix in the most optimal way.

fancompute / neuroptica

Vectorized #7