Fused entry-wise layers

timmoon10 commented 6 years ago

Our forward/backward prop implementation requires that we store every layer's activations and error signals. However, fusing entry-wise operations together would help us avoid having to store intermediate values, increasing our capacity. Since these operations are often memory-bound, it would also boost performance since we would access data once instead of at each forward/backward prop step. Steps to implement this functionality:

The fused entry-wise layer can execute a series of operations, possibly implemented as lambda functions.
The fused entry-wise layer can perform entry-wise automatic differentiation of its sequence of operations.
The model can parse its layer graph and construct an appropriate fused entry-wise layer.

This functionality will become especially important if #193 is implemented since custom objective functions will often require a sequence of entry-wise operations prior to a reduction.

timmoon10 commented 5 years ago

I'm not sure if this is currently possible since CUDA kernels don't support polymorphism. Attempts to mimick polymorphism with device function pointers haven't had any success.

ndryden commented 5 years ago

I think it may be worth looking at what other frameworks do, since fusing operations is a common optimization. TensorFlow has a combination of manually fused operations and their XLA compiler can do it both ahead of time and JIT. PyTorch has tensor comprehensions (and here). Caffe2 (which is merging into PyTorch) also does kernel fusion for deployment. MXNet also does fusion. Chainer doesn't do it yet, but appears to be moving in that direction.

LLNL / lbann

Fused entry-wise layers #194