Closed jaberg closed 11 years ago
A better toy example where merging would be useful is
x = tensor.vector('x')
y = tensor.vector('y')
ws = Workspace()
ws[x] = [1, 2]
ws[y] = [3, 4]
ws.compile_update('f', [
(x, 2 * x),
(y, 2 * y)])
ws.optimize_memory_layout()
In this situation, the Workspace is in a position to merge its storage of x
and y
into a single vector of length 4, and compute x and y at the same time with a single elementwise multiplication. It can just fetch the first half of it when the user asks for ws[x]
and the second half when asked for ws[y]
.
So for anyone that's monitoring the developments here, things are not done, but the proof of concept is working out I think. It is possible to merge multiple isomorphic computations together at the theano level, and that does bring greater speed. The tests in test_workspace.py are working with the sort of graph shown above where multiple vectors are updated with twice their value. On one core of CPU with 26 groups (one for each letter of alphabet) of 50 elements each, the speed improves by about 12x and even with 1000-element groups the speed improves by 4x. I haven't tried a GPU, but presumably the advantage there would be much bigger.
Also, a few changes to theano were necessary, but hopefully those will be merged into the dev branch there soon.
The current optimizations are aimed at regrouping graphs of elemwise, subtensor, and incsubtensor of vectors. They are not very robust, they will need more work to (a) work in all cases where we need them and (b) not apply when they should not.
Currently, the workspace mechanism bypasses the mechanism by which theano swaps between normal and debug mode, so the current tests do not use debug_mode to verify optimization correctness. The nice way to fix this would be to implement the long-standing goal of moving debug_mode functionality into graph optimizations. The quick & dirty way to fix this would be to hack the CompiledUpdate class to go through theano.function with mode="DEBUG_MODE"
The optimizations implemented here are certainly not enough to handle all of e.g. test_runtime yet because (a) the state is stored in matrices there instead of vectors [minor] and (b) I think theano does not include the all-at-a-time batched_dot-like Op to represent the application of all encoders to all inputs. Even if batched_dot is sufficient, the implementation is not parallel at the moment, so there's some engineering to be done there. Once those big obstacles are dealt with there will be other issues too of course.
For this project, it might actually turn out that multicore CPUs are a good fit for the computations, it may be very interesting to make sure that the batched_dot and the elemwise code generators are at least marked up with pragmas for loop parallelization. It might even be worth doing opencl implementations of the key ops to ensure a good parallel algorithm is used.
Very cool! If there are areas that could use more testing support, or anything else that someone not fluent in Theano can help with, let me/us know, as this seems really essential for scaling up to really big models.
Want to sketch in more realistic test cases into test_workspace.py ?
Will do!
There's some interest on theano-dev in this feature, so I'm going to close this PR. Moving the code to
https://github.com/jaberg/theano_workspace
That project can have its own issue tracker, documentation etc, which is all pretty independent of nef-py.
Test code that corresponds to nef-py use cases would still be great, preferably as a an PR to theano_workspaces. Down the road theano_workspace may get merged into theano, but for now it will be more wild-west hacking time than theano can handle.
Alrighty, I'll make a PR over there.
So what I'm calling "workspaces" have been on my radar for a long time for Theano as a way of simplifying the use of shared variables and eliminating hundreds (thousands??) of lines of crap in Theano.
The idea is to not have shared variables, instead have a workspace:
The reason to use this mechanism is so that
This last point follows up on recent discussion of how to conglomerate the state information across ensembles (and create ensemble views etc.) for the purpose of running models fast. It wouldn't be the same thing as making ensembles all views, but it would allow us to get fast implementations of the current type of ensemble, namely, the kind that appears to manage its own memory.