WIP: workspace idea (do not merge!)

jaberg commented 11 years ago

So what I'm calling "workspaces" have been on my radar for a long time for Theano as a way of simplifying the use of shared variables and eliminating hundreds (thousands??) of lines of crap in Theano.

The idea is to not have shared variables, instead have a workspace:

x = tensor.vector('x')
y = tensor.vector('y')

ws = Workspace()
ws[x] = [1, 2]
ws[y] = [3, 4]
ws.compile_update('f', [
        (x, 2 * x),
        (y, x + y)])

assert np.allclose([ws[x], ws[y]],[[1, 2], [3, 4]])

# compute stuff
ws.run_update('f')
assert np.allclose([ws[x], ws[y]],[[2, 4], [4, 6]])

# trigger graph optimizations
# and potentially rearrange the layout of variables in physical memory
# copy over the values of the variables if they have moved
ws.optimize_memory_layout(device='gpu')

# now run a freshly-compiled function that is running on the GPU
ws.run_update('f')

# bring the value back from wherever it is (like shared.get_value())
print ws[x]

The reason to use this mechanism is so that

an easily-debugged version of math can be built and run in the default workspace
this basic workspace is something that we can pickle properly
optimized workspaces can be converted back to simple ones for saving
being able to "jointly" optimize functions based on runtime statistics is a big win, because Theano's shape inference is not strong
crucial for nef-py the optimize_memory_layout function can merge together the ensemble state information so that large numbers of ensembles can be evaluated without large numbers of graph nodes

This last point follows up on recent discussion of how to conglomerate the state information across ensembles (and create ensemble views etc.) for the purpose of running models fast. It wouldn't be the same thing as making ensembles all views, but it would allow us to get fast implementations of the current type of ensemble, namely, the kind that appears to manage its own memory.

jaberg commented 11 years ago

A better toy example where merging would be useful is

x = tensor.vector('x')
y = tensor.vector('y')

ws = Workspace()
ws[x] = [1, 2]
ws[y] = [3, 4]
ws.compile_update('f', [
        (x, 2 * x),
        (y, 2 * y)])
ws.optimize_memory_layout()

In this situation, the Workspace is in a position to merge its storage of x and y into a single vector of length 4, and compute x and y at the same time with a single elementwise multiplication. It can just fetch the first half of it when the user asks for ws[x] and the second half when asked for ws[y].

jaberg commented 11 years ago

So for anyone that's monitoring the developments here, things are not done, but the proof of concept is working out I think. It is possible to merge multiple isomorphic computations together at the theano level, and that does bring greater speed. The tests in test_workspace.py are working with the sort of graph shown above where multiple vectors are updated with twice their value. On one core of CPU with 26 groups (one for each letter of alphabet) of 50 elements each, the speed improves by about 12x and even with 1000-element groups the speed improves by 4x. I haven't tried a GPU, but presumably the advantage there would be much bigger.

Also, a few changes to theano were necessary, but hopefully those will be merged into the dev branch there soon.

The current optimizations are aimed at regrouping graphs of elemwise, subtensor, and incsubtensor of vectors. They are not very robust, they will need more work to (a) work in all cases where we need them and (b) not apply when they should not.

Currently, the workspace mechanism bypasses the mechanism by which theano swaps between normal and debug mode, so the current tests do not use debug_mode to verify optimization correctness. The nice way to fix this would be to implement the long-standing goal of moving debug_mode functionality into graph optimizations. The quick & dirty way to fix this would be to hack the CompiledUpdate class to go through theano.function with mode="DEBUG_MODE"

The optimizations implemented here are certainly not enough to handle all of e.g. test_runtime yet because (a) the state is stored in matrices there instead of vectors [minor] and (b) I think theano does not include the all-at-a-time batched_dot-like Op to represent the application of all encoders to all inputs. Even if batched_dot is sufficient, the implementation is not parallel at the moment, so there's some engineering to be done there. Once those big obstacles are dealt with there will be other issues too of course.

For this project, it might actually turn out that multicore CPUs are a good fit for the computations, it may be very interesting to make sure that the batched_dot and the elemwise code generators are at least marked up with pragmas for loop parallelization. It might even be worth doing opencl implementations of the key ops to ensure a good parallel algorithm is used.

tbekolay commented 11 years ago

Very cool! If there are areas that could use more testing support, or anything else that someone not fluent in Theano can help with, let me/us know, as this seems really essential for scaling up to really big models.

jaberg commented 11 years ago

Want to sketch in more realistic test cases into test_workspace.py ?

tbekolay commented 11 years ago

Will do!

jaberg commented 11 years ago

There's some interest on theano-dev in this feature, so I'm going to close this PR. Moving the code to

https://github.com/jaberg/theano_workspace

That project can have its own issue tracker, documentation etc, which is all pretty independent of nef-py.

Test code that corresponds to nef-py use cases would still be great, preferably as a an PR to theano_workspaces. Down the road theano_workspace may get merged into theano, but for now it will be more wild-west hacking time than theano can handle.

tbekolay commented 11 years ago

Alrighty, I'll make a PR over there.

ctn-archive / nengo_theano

WIP: workspace idea (do not merge!) #21