Netflix / vectorflow

Apache License 2.0
1.29k stars 86 forks source link

Sparse net leaf/predict #12

Open yvdriess opened 7 years ago

yvdriess commented 7 years ago

I have a multi-target workload with very sparse labels. In other words, a single observable/sample is not a (bool label, SparseF[] features) tuple, but rather a (SparseF[] labels, SparseF[] features). During the learn/optimize steps, the predict run currently appears to just produce a dense vector of predictions (e.g. a 65k vector), but only a few elements (e.g. 200) of that feature vector will have a label and contribute to the gradient.

a) Is there a way in vectorflow to deal with the multi-target learning scenario. I suppose I should make a loss function that scatters the non-zero gradients into a zero-vector of the predict-vector size? b) Where I can see me doing a), it is kind of wasteful as it means producing a large vector of predictions where only a handful are used to produce the gradient.

I implemented b) In Tensorflow by gathering columns of the last-layer weight matrix, where the gathered columns match the label's non-zero coordinates (viz. target index). For Vectorflow, it probably makes more sense to have a SparseData() leaf/output layer. It appears that float[] dense output is currently hardcoded in the learning interface.. The loss function interface could probably be kept the same, as it should not care about the actual target index, but there needs to be some logic that produces a float[] from the SparseF[] output before passing it to the loss function, and then transforms the dense gradient vector back into a sparse gradient for back-prop. SparseLinear layer? (dense prev-vector + SparseData input, SparseData output)

rostyboost commented 6 years ago

Thanks for your question @yvdriess. If I understand correctly, you still want a dense forward-prop for your loss, but your gradient that needs to be backpropagated is sparse, because it only involves a subset of the output neurons (presumably on the coordinates matching what's in your SparseF[] labels?). This is definitely possible in vectorflow. There is an overload of the callback mechanism resolved at compile-time where you can populate a sparse gradient for your loss. It's not so well documented, but you can find a dummy example here. So you can just append the non-zero SparseF coordinates of the gradient into this buffer. This would be the most efficient way for your problem, as this would avoid any unnecessary "dense to sparse" operation. Typically vectorflow works by reference and tries to avoid any memcopy, so there is no "gather" as it is assumed that the memcopy cost outweights any potential SIMD-gain from "densifying" a sparse operation, which tends to be the case for most IO-bound / shallow problems.

yvdriess commented 6 years ago

Thanks for the response, I bit the bullet, dove in and indeed found the SparseF[] grad overload for the loss functions. That takes care of a), but I am still not sure how to do b) without implementing a SparseLinearLayer or similar myself.

The basic idea is that:

Optimizing for the above case by making a sparse linear layer that takes an extra coordinates argument and produces SparseF[] output would decrease the amount of memory touched. (The compute probably is not a factor here.) In our workload, the labels vector is very sparse (~1/1500), so this optimization should have a big impact.

I guess the real question is: Is float[] network output baked in everywhere? Or should it be relatively straightforward for me to make a SparseLinearLayer that produces a SparseF[] network output?

yvdriess commented 6 years ago

If I may resurrect this issue: I just noticed while adding a SparseLinear layer to Vectorflow that only the first layer can be sparse.

https://github.com/Netflix/vectorflow/blob/aed9977ed07d91c0f23e4bf866cc711dd21c4a2f/src/vectorflow/layers.d#L119

Is this work in progress, or should I provide an implementation?