PetaVision / OpenPV

PetaVision is a C++ library for designing and deploying large-scale neurally-inspired computational models.
http://petavision.github.io
Eclipse Public License 1.0
40 stars 13 forks source link

use eligibility list to speed up LCA #38

Open garkenyon opened 8 years ago

garkenyon commented 8 years ago

I think we can produce a significant speedup in our execution of LCA. Since most of the neurons in any LCA computation are below threshold for most of the display period, perhaps we could just ignore them (with occasional reality checks) and work entirely within the much smaller subset of neurons that are either active or only slightly below threshold. I believe it should be possible to exploit this idea to obtain a big win with relatively minor changes to the code.

It seems that the basic strategy would thus be:

  1. Make a list of "eligible" neurons whose membrane potentials are positive (V >= 0). One could also pick a different threshold value for eligibility but 0 seems like a natural choice.
  2. For each time step within a given display period 'i' evaluate whether i < T1 || mod(i,T2) == 0

if the above condition is false, only receive input to and update the neurons in the eligible list. Since the receive step is typically performed on the GPU from the post perspective, this should be a relatively simple change. the neuron index would be drawn from the eligible list rather than from the list of all neurons. note that updates are typically performed on the GPU as well and would also benefit from a much smaller list of targets to update.

if the above condition is true, then receive input to and update ALL neurons. the idea is that once the network has settled into a good enough approximation to the optimal sparse code, we only have to occasionally check whether a neuron has been promoted to the eligible category or alternatively demoted out of the eligible population.

in the above, T1 would be set to the smallest value at which eligible neurons could be identified with reasonable accuracy and T2 would be set to the largest value we can get away such that the final answer was independent of T2 (to within some tolerance).

Since we are somewhat CPU bound right now, the above scheme would probably have the biggest impact on Trinity. However, by taking advantage of sparsity even when receiving from the post perspective, we might also be able to obtain competitive performance on CPU only machines, which are much cheaper in the AWS cloud right now compared to GPU machines.

The above scheme should also speed up our computations on the GPU considerably, which can only help our overall performance.

It's also possible that our timing information is too crude to resolve what's actually happening during a display period. It might well be that we're only CPU bound during the initial portion of the display period when the activity is not very sparse due to the recent image flip. Later in the display period, once the activity is more sparse, we might might be more GPU bound, at least relative to the initial portion of the display period. Thus, we might get a bigger win even on a GPU machine than our current timing results indicate. Right now, we're about equally divided between GPU and CPU based convolutions, so speeding up the GPU convolutions should still give us a decent speedup.

Of course, the above behavior would have to be under the control of a "useEligibilityListFlag" since we're not always doing LCA.

slundqui commented 8 years ago

Starting to implement this via commit 7d25d960552dbba3992cc6c50a35e6353d65bc2b. Currently in a pending pull request.

dpaiton commented 8 years ago

I'm trying to figure out how much this will influence your model, but to do that I need some stats. What would an example eligibility threshold be? And an example activation threshold? What would your display period be? Does this also have an adaptive time step?

slundqui commented 8 years ago

In talking with Dylan, I have a few questions that this topic raises.

  1. Say we pick a timestep within a display period to start using this method. Is there ever a chance for a neuron who is not in the eligible list to become eligible? I believe it can never become eligible, as our LCA neuron is a leaky integrator. All eligible neurons will have a huge advantage over non-eligible neurons, since they get to integrate T2 times more than their non-eligible counterparts. Therefore, I believe the mod(i, T2) == 0 portion is irrelevant.
  2. This method is making the assumption that the eligibility list will not change after time T1. If this assumption is true, then why are we spending effort in making sure the energy gets to the lowest point within a display period (with the addition of the adaptive timestep)? It seems that if we are making this assumption, we're implying that an approximate sparse encoding will suffice for our purposes, to which there are many existing algorithms to achieve sparse coding that are much more efficient than LCA (for example, a feedforward net to predict the sparse approximation).

Food for thought.

garkenyon commented 8 years ago

The list of eligible neurons can definitely change during a display period. If a previously active neuron becomes inactive this can open up space for a previously inactive neuron to become active. It's for this reason that greedy methods are inaccurate in general. The winners now will later be last, or can be. It could be that the eligibility list should include any neurons whose membrane potentials are increasing, regardless of their current membrane potentials, to help make sure we don't deprive some up and coming neuron of its big opportunity. Has anyone formally checked how well sparse estimation algorithms work? It would be an interesting project to test a full blown sparse solver against a single layer estimator wrt classification, say on CIFAR or PASCAL. Currently, I'm training a sparse estimator to predict the sparse code of the next block of frames from the sparse code of the block of frames shifted back by one frame. I'm planning on using this estimate as an initial condition however, not as my final answer. Could we do something similar with existing sparse estimation techniques? Estimate the next sparse representation but then perform a reduced number of gradient descent steps to fine tune the estimate?

dpaiton commented 8 years ago

How much would it cost to maintain the list of eligible neurons? Do you lose your computational advantage if the process is too complicated?

As far as alternative sparse coding methods, LCA will find an approximation that is nearly (if not exactly) the same as ISTA. Since the PV models typically use LCA+Momentum, you should expect it to find a solution that matches the FISTA solution. This is assuming no adaptive time step, of course. Conjugate Gradient Descent is another method that will find a solution that should be quite similar to the ISTA solution, but will find it much faster than ISTA. Sheng told me that @maxtheiler has implemented the convolutional FISTA model described in R Chalasani, JC Principe, N Ramakrishnan (2013) - A Fast Proximal Method for Convolutional Sparse Coding in PetaVision. I believe it would be valuable to compare this directly to the convolutional LCA + Momentum model (w/out adaptive time step) for both speed and similarity of the sparse solution.

LISTA uses a feedforward net to approximate the sparse coding solution. This is probably the fastest way to produce a sparse code, but it completely defeats the point of doing sparse coding in my opinion. If you want a model that can incorporate top-down feedback and produce a predictive code and demonstrate non-linear (i.e. non-classical) receptive field properties, then LISTA is no good. Another problem with LISTA is that it attempts to find the initial MAP estimate from a sparse coding model. This initial fixed point is not necessarily the best fixed point. From @wshainin recent work, it seems that a model which dynamically seeks out an alternate fixed point using additional constraints (outside the typical sparse coding constraints) is going to be more effective as a precursor to a supervised network. This wouldn't work with any LISTA-type method.

garkenyon commented 8 years ago

Maintaining an eligibility list should require trivial resources, especially compared to the potential for a factor of ten savings overall. On a GPU, none of this matters, we just let NVIDIA perform the convolutions to every neuron regardless of how far below threshold any particular neuron might be. But on a non-GPU machine, like Trinity or the 256 node non-GPU machine we just got permission to run on, we know that 90% of the cycles are spent in the non-sparse convolution over the error layer needed to compute the input to the LCA layer. The key point is that 90%of these convolutions are probably unnecessary since these neurons aren't ever getting over threshold. However, we have to be careful because it's not entirely obvious who the winning neurons are going to be. I appreciate the comments about LISTA. I suspect a true sparse solver will perform much better than LISTA and the point about non-classical effects being lost are very pertinent. However, it would be interesting to quantify this. Max Theiler implemented straight up ISTA as far as I recall. Our LCA implementation is rectifying, which creates differences from ISTA but in principle rectification is not required. However, biological neurons are rectified, as are the spiking neuromorphic LCA chips. I believe in sticking with the biology when there is no compelling reason not to. I also agree that it's interesting to determine whether LCA with momentum is similar to FISTA. All gradient descent algorithms employ an adaptive time step as far as I know. BFGS is definitely a promising direction but a bit complicated to code. We have the gradients so in principle it's very doable however. For convex models, the MAP estimate is the only estimate that makes sense to me. For non convex models, God help you. Video might help to build up a better solution over frames. In general, I don't believe in sparse models for visual processing on anything but video. A human couldn't learn to see if only exposed to static images and neither can LCA. I hope we will soon be able to test ideas ideas about non convex models in the context of the sparse coding of video.

dpaiton commented 8 years ago

Regarding the issue at hand, I think if you're careful about which neurons you turn off you should see a significant speedup on CPU as you describe. Choosing which neurons to turn off and when to turn them off will be tricky, but not impossible. Even with a conservative estimate you should see a CPU speedup.