Make C calculation more memory efficient

Currently, the C calculation stores tensors for C_output, W_output, etc., throughout the calculation of all Cs. There's also a lot of code repetition in calculating the output tensors and every other tensors.

The ideal solution would be to make a function which calculates the C, and gets called for C_output as well as each other node layer.

A less ideal solution, but one that would also solve the memory issue, would be to have a function which just calculates the output tensors and returns what is needed (perhaps returns it on the CPU so the GPU gets flushed.

ApolloResearch / rib

Make C calculation more memory efficient #300