Temporary bandaid for `sum` on GPU.

Let GPU sum forward the CPU implementation for now.
- It's clearly suboptimal for nllLoss and sum to transfer between CPU/GPU, but defining the ops in this way is good for getting things to work.
- I would argue that it's important to encapsulate transfer logic within ops, to prevent transfer-related bugs. Future GPU implementations of nllLoss and sum can then be drop-in replacements.

The next blocker for MNIST CNN model is GPU elementwise op backpropagation logic, which isn't implemented.

feiwang3311 / Lantern