LLNL / lbann

Livermore Big Artificial Neural Network Toolkit
http://software.llnl.gov/lbann/
Other
221 stars 80 forks source link

Fused entry-wise layers #194

Open timmoon10 opened 6 years ago

timmoon10 commented 6 years ago

Our forward/backward prop implementation requires that we store every layer's activations and error signals. However, fusing entry-wise operations together would help us avoid having to store intermediate values, increasing our capacity. Since these operations are often memory-bound, it would also boost performance since we would access data once instead of at each forward/backward prop step. Steps to implement this functionality:

This functionality will become especially important if #193 is implemented since custom objective functions will often require a sequence of entry-wise operations prior to a reduction.

timmoon10 commented 5 years ago

I'm not sure if this is currently possible since CUDA kernels don't support polymorphism. Attempts to mimick polymorphism with device function pointers haven't had any success.

ndryden commented 5 years ago

I think it may be worth looking at what other frameworks do, since fusing operations is a common optimization. TensorFlow has a combination of manually fused operations and their XLA compiler can do it both ahead of time and JIT. PyTorch has tensor comprehensions (and here). Caffe2 (which is merging into PyTorch) also does kernel fusion for deployment. MXNet also does fusion. Chainer doesn't do it yet, but appears to be moving in that direction.