Support Convolution with MV-valued kernels

Can naively be implemented using matrix multiply and toeplitz matrix but is way too wasteful so it really should be implemented using a custom op (see https://github.com/tensorflow/custom-op).

Should support standard convolution and graph-convolution (useful eg. for operating on vertices / meshes).

// Edit Tried implementing this in two different ways using only built-in ops:

Using N^2 normal convolutions (N=#blades), build x_ij... = conv(a_...i, k_j), then contract with cayley tensor to get result x_ij..., c_ijk -> y_...k. This approach uses way too much memory (as we store something the size of the output times N^2).
Same as first, but instead of doing N^2 normal convs in parallel, we do them sequentially and accumulate the results. The memory usage of this is good but is way too slow (as the convs are done sequentially)

The only good solution seems to be writing a custom op with cuda kernel.

RobinKa / tfga

Support Convolution with MV-valued kernels #8