Usage equivariant MLP

vec123 opened 10 months ago

vec123 commented 10 months ago


i am attempting to build a equivariant Variational Encoder-Decoder framework.

For this I am using R2Conv() and R3(Conv) layers in the encoder with trivial-representation input & output and regular-representations in between. For the Decoder I would like to use equivariant MLPs. However it is quite unclear to me how the examples map to a generic MLP.

For example I do not understand how one could specify the input and output-dimension respectively. Instead it seems to me, that the equivariant MLP expects (just like a CNN) a 2D or 3D dimensional input, and that the output dimension is determined by the Harmonics-decomposition of functions on that space. In contrast to that a MLP accepts a flat input and the (flat) output dimension is a hyperparameter specified by the user.

During my learning process, I start with a rectangular input grid of shape [B,1,X,Y,Z] corresponding to a scalar (field trivial representation). Use R3(Conv) to get [B,1,X,Y,1] with one hidden regular-representation and a trivial representation output, store [B,1,Z_encoding_size] as the encoding of Z and continue with [B,X,Y,1] and R2Conv() to obtain the encodings of X and Y in shape [B, 1, X_encoding_size, Y_encoding_size]. A final linear layer maps the [B,1, X_encoding_size , Y_encoding_size , Z_encoding_size] shaped encoding to a latent-space that parametrizes the mean and variance of a distribution.

This to me seems more or less clear. The Decoder part much less.

I really hope for some clarification. The equivariant learning procedure is something I only discovered a week ago and it seems like opening the Pandora box considering all the nice but extensive theory behind it.
Sadly I do not have the time to pick up on it nor is there anyone in my environment who knows that stuff. Is it reasonable to expect having a learning model within a week?

maxxxzdn commented 7 months ago


did you see the MLP example https://github.com/QUVA-Lab/escnn/blob/master/examples/mlp.ipynb?

an equivariant MLP doesn't expect a base space (2D nor 3D), it works exactly as a classic MLP and takes only a stack of feature fields:

G = group.so3_group()

# since we are building an MLP, there is no base-space
gspace = gspaces.no_base_space(self.G)

# assume you have scalar and vector quantities in your output:
scalar_repr = gspace.trivial_repr
vector_repr = gspace.fibergroup.standard_representation()

# assume your output goes like [[scalar, vector], [scalar, vector], ...., [scalar, vector]]
channel_repr = group.directsum([scalar_repr, vector_repr])

# specify the number of channels in input and output
c_in = 1
c_out = 12
in_repr = c_in * [channel_repr]
out_repr = c_out * [channel_repr]

# define feature field type
in_type = gspace.type(*in_repr)
out_type = gspace.type(*out_repr)

# define your MLP
mlp = MLP(in_type, out_type)

As a result, you will give your MLP "flat" stack of features (here, 1 copy of [scalar, vector]) and get back another stack of features but now with 12 copies.

Danfoa commented 6 months ago

Hi @maxxxzdn and @Gabri95,

Following your suggestion, say I build a equiv MLP of input, hidden, output equivariant linear layers with some activation function. Such that the hidden layer group representation is defined by hidden_repr = c_in * [channel_repr]

By Shur's lemma, we know there is no linear map between feature fields of different types, therefore, this naive construction of the equivariant MLP will result in a network which never mixes the signals from scalar and vector representations. That is, this network, will result in a decoupled network processing only scalar fields to scalar fields, and vector fields to vector fields. This is clearly a bad architectural design.

To mix fields of different types, we are required to perform the CG tensor product, however it is not clear how to use this, and specially it is unclear what are good design principles for embedding the CG tensor product in the architecture.

Any insights?

maxxxzdn commented 5 months ago

Please note that the interaction between irreps will happen in non-linearity (e.g. QuotientELU), so use nn.TensorProductModule is not the only way.

Danfoa commented 5 months ago

Hi @maxxxzdn can you point out a paper/lecture-note/escnn-documentation page where the action of this quotient activations are clearly explained? I am afraid I am unable to comprehend from the documentation how signals from different irreps are being mixed in this fashion.

Gabri95 commented 4 weeks ago

Maybe it's simpler to explain it via an example. Say we build a G=SO(3) equivariant MLP using Fourier-based pointwise non-linearities.

If we use the FourierELU non-linearity, it assumes we employ a bandlimitied regular representation of SO(3). Then, the resulting architecture resembles the internal layers of the original Spherical CNNs paper by Cohen et al. (which is equivalent to a GCNN over SO(3)), which alternates

Our SO(3) MLP would be essentially identical, with the difference that we don't actually implement the FFT but use the dense FT matrix and that we represent the sequence as follows for convenience (s.t. the FT and IFT are merged in the non-linear layer):

Note that the convolution theorem used above is just another way of thinking about Schur's Lemma!

In case of quotient representations, we restrict the considerations to signals over a quotient space X=G/H which are nothing more than signals over G but which are constant over the H cosets. This is the case for spherical signals, which can be thought as SO(3) signals constant over the SO(2) fibers. The formulation above remains identical but, because the signal lives in a smaller space, we require less Fourier coefficients and, since the signal is constant along the H fibers, we only require sampling in the domain X rather than G.

For example, for spherical signals, we only need the spherical harmonics (which correspond to one column of each of the Wigner D matrices, in the right basis). Once can prove that this architecture is equivalent to a Spherical CNN using only zonal spherical filters (i.e. filters which are invariant wrt rotations along a certain axis): this is of course less expressive than a full GCNN over SO(3), but is also less expensive.

Mixture of frequencies happens in the non-linearity which acts pointwise in the "spatial" domain, while convolution acts pointwise in the "frequency" domain. This is the exact same design idea behind any convolution network (covering both CNNs and GCNNs).

Whether this strategy is better than tensor-product / Clebsh-Gordan transform is more of a practical question. Here's a couple of useful insights, though. The Clebsh-Gordan transform is nothing more than a quadratic non-linearity, which means it can at most double the number of frequencies in a signal. This is convenient because the output signal is still bandlimited and the operation can be implemented in an exact way, preserving exact equivariance. Conversely, general pointwise activations can introduce arbitrarily high frequencies and have more freedom to mix them but this comes with the drawback that they typically require some approximation and don't guarantee exact equivairance (although this error can be controlled).

I hope this is useful!

Best, Gabriele