Permutation equivariance of input features dimensions

How do we implement permutation equivariance of input features? (see this as an example).

Note that this is not necessarily something we want to implement for all models. When dealing with specific models, input features dimensions are not equivariant, since they have a specific meaning. However, we want this equivariance e.g. for generic GP models.

I have a coupe of ideas for this. I will write them below as two separate comments.

Solution 1: Permutation equivariance via explicit summation

This is what we sketched at the meeting.
We hard-code all possible permutations "by hand" in the encoder and decoder.
This approach works only for low-dimension $D = 2, 3$ or little more, as we multiply training cost and inference cost by a factor $D!$ (yes, that's a factorial).

Specifically, let $\pi \in \mathcal{S}_D$ be a permutation, where $\mathcal{S}_2 = \{ (1,2), (2,1) \}$, $\mathcal{S}_3 = \{ (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), (3,2,1) \}$, etc. We write $\pi \mathbf{x}$ as the vector where the permutation $\pi$ is applied to the elements of the vector $\mathbf{x}$.

Encoder

The relational encoding under this model is a sequence of encodings, applying (in some fixed order) all permutations $\pi \in \mathcal{S}_D$:

$\rho(\mathbf{x^\star}; \pi) = \bigoplusi h\theta\left(g(\pi \mathbf{x}^\star, \pi \mathbf{x}_i), \mathbf{y}_i\right)$

so we compute $D!$ encodings for each point.
For example, in $D=2$ we would have $\rho(\mathbf{x^\star}) \equiv (\rho(\mathbf{x^\star}; (1,2)), \rho(\mathbf{x^\star}; (2,1)))$ (consider them as saved in a list or ordered set, for each point).
The neural process encoder (which takes a context set and maps it to a vector) operates on each permuted set separately (as if these were separate context sets), so the output is $D!$ representation vectors.

Decoder

The decoder takes as input $D!$ vectors (all representations obtained by permuting the order of the inputs in the context set), and a target set of points $\mathbf{X}^\star$.
As above, we encode each point in the target set $D!$ times, one for each permutation. So we get $D!$ encoded target sets.
We then compute the prediction separately for each permutation. So for example, for $D = 2$ we would have a prediction on the target set for the permutation $(1, 2)$ (the identity) and a prediction for the target set for the permutation $(2, 1)$ (in which we have swapped the two input features, both in the context and target set).
Namely, in the (R)CNP the prediction would be $\mu(\mathbf{x^\star})$ and $\sigma(\mathbf{x^\star})$ for any element in the target set. Here we are computing each prediction separately for all permutations $\pi$, $\mu(\mathbf{\pi x^\star})$ and $\sigma(\pi \mathbf{x^\star})$.
Finally, the output of the whole thing for mu and raw_sigma is the average of the prediction across permutations (this average is taken before applying the nonlinearity, e.g. the softplus applied to raw_sigma to compute $\sigma$).

Comments

This should work out of the box.
This method is not unique to our approach; it is an independent technique that could be implemented with any neural process.
There might be slightly better ways of implementing it (e.g., where to do the summation), but in practice I don't think we can get around the $D!$ cost with this approach.
As mentioned above, probably not worth applying for $D > 3$.

Solution 2: Canonical ordering

Assume for any given context set of points $\mathbf{X}_c = \{\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\}$ of $D$-dimensional vectors, we can establish a unique ordering of the input features.
As an example (sum-sort), sum all vectors $\bar{\mathbf{x}} = \sum_i \mathbf{x}_i$ and order the features according to the permutation $\pi$ such that the elements of $\pi \bar{\mathbf{x}}$ are in increasing order. (This is not necessarily unique, in that you may have ambiguity if two elements of $\bar{\mathbf{x}}$ have the same value, but we will get back to that.)
This establishes a canonical ordering for the input features of the context set $\mathbf{X}_c$.

Encoder

First, compute the canonical ordering of the context set.
Apply the appropriate permutation $\pi_c$ to all the elements in the context set. That's the set that gets the relational encoding.
Store the used permutation $\pi_c$ (which is now fixed for this context set).
Proceed with business as usual (no other change in the encoder).

Decoder

Apply the stored permutation $\pi_c$ to elements in the target set.
Then apply the decoder and prediction as usual.

Comments

In pure mathematical terms, assuming that a unique canonical encoding exists for the context set, this technique trivially works.
As an example, let's consider the sum-sort canonical encoding proposed above. It is easy to show that if $\bar{\mathbf{x}}$ has no repeated elements, then you get the same prediction when you permute the order of input features in the context set and target set.
In practice, the prediction is discontinuous as a function of the context set (because moving around points in the context set can trigger a switch of the canonical ordering, which introduces a discontinuity). However, not sure if we should worry too much about that - what we care most is that the prediction is continuous as a function of the target, and that's fine.
The sum-sort ordering can easily have duplicates, but we could introduce rules for breaking ties (e.g., start with sum-sort, then break the ties by ordering with the sum-of-squares-sort, sum-of-p-power-sort, etc.).
There are cases in which ties cannot be broken by any ruling, because the context set is entirely symmetric (e.g., a full grid over $[0, 1]^D$). However, not sure how much we should worry about that. Grids are unlikely above $D > 3$ and I am not sure if these borderline cases would cause major problems or simply behave a little worse (because the encoding is not unique).
Testing whether this works (and if the aforementioned potential issues are critical flaws or just minor nuisances) is an empirical question - unless there is some other problem I am not seeing.
Same as Solution 1, this solution is not unique to our translational-invariant approach; it is an independent/orthogonal technique that we are proposing that can also be applied to any NP.

Solution 3: Permutation invariance via bi-dimensional deep set

The neural diffusion process (NDP) paper (note that the OpenReview version is more recent than the arXiv version) uses a bi-dimensional attention mechanism to induce permutation invariance over both (1) the elements of the context set and (2) the input features.
Unless I am missing something, I don't think that "attention" per se is the key here (although it surely helps with performance); the point is how attention implements a "deep set" mechanism (i.e., the ordering of the input is irrelevant), which induces permutation invariance. In this case, there is a kind of "deep set" that works on both columns and rows of the data (where columns is the data points and rows is the input dimensions).
However, as we discussed in the meeting, in our case we do not quite want permutation invariance for the representation of the context set alone. We want permutation invariance if we apply the same permutation of the input features to both the context set and the target set. (I am a bit unclear here; the NDP does not seem to distinguish between context and target? I should re-read the paper in full...)
So, perhaps we can adapt this idea from the NDP paper, but we cannot use it "as is" in our current architecture. If we did that, we would get a permutation invariant representation of the context set alone, but that's not what we want. We would likely need to change the CNP architecture. For example, it might be impossible to encode the context set first and decode the context + target set as a separate step (as per the standard neural process procedure, including ours); perhaps we would need to do both things in the same pass (meaning that the context set has to be aware of the target set). However, I am not super-clear here.
Incidentally, introducing permutation invariance of input features this way seems to produce as a bonus that the NDP can be trained and applied simultaneously to a variable number of dimensions (just like now we can apply the neural process to a variable number of data points); this is a very appealing property. Of course, the training needs to have seen these dimensions, it's unlikely to generalize from $D = 3$ to $D = 50$ (just like when training a CNP we need to use a variable number of context and target points).

In conclusion, I think that this approach is very appealing, but I am not 100% sure how to apply this in the context of the standard CNP architecture (and ours).

I was wondering why a permutation equivariant architecture (like in AHGP paper) might not work. Luigi's intuition is that the way they implement scrambles the information up too much (dimensions belonging together can only be recovered based on $y$-value).

One useful set of synthetic test cases might be non-axis-aligned periodic functions in higher-D: here, learning the correlations between dimensions = learning direction of the wave turns the problem into a simple 1D problem, and similar $y$-values repeat often.

Solution 4: Permutation equivariance via relational bi-dimensional deep set

OK, I think I cracked this (at least theoretically, not sure how it is going to work in practice).

I'll explain first how to introduce permutation equivariance (with respect to features/input dimensions) in a standard CNPs. This also affords simultaneous training on a multiple number of feature dimensions.

Previous approaches using bi-dimensional deep sets/attention

As a reminder, the current approach suggested by the AHGP paper and similarly in the neural diffusion process paper is to have a bidimensional deep set / attention mechanism. None of this has been applied specifically to CNPs to my knowledge, but the application would be a trivial extension (especially given the neural diffusion process paper).

In the following, I denote with xn(i) the $i$-th element of the input vector $\mathbf{x}_n \in \mathbb{R}^{d_x}$, where $d_x$ is the number of input features; and with yn the output vector $\mathbf{y}_n \in \mathbb{R}^{d_y}$; for $1 \le n \le N$ with $N$ the size of the context set. For simplicity, we can restrict ourselves to the case $d_y = 1$, but there should be no difference for the multi-output case.

Let's put our context set on a table of pairs (the vector yn is repeated for each input dimension):

(x1(1), y1),  (x1(2), y1), ..., (x1(d_x),y1)
(x2(1), y2),  (x2(2), y2), ..., (x2(d_x),y2)
...
(xN(1), yN),  (xN(2), yN), ..., (xN(d_x),yN)

In a nutshell, what previous methods did is first to build a permutation invariant representation for each column of this table. First, we embed each pair (xn(i), yn) into a higher dimensional vector $\mathbf{z}_{n,i}$, then we aggregate over the data dimension (i.e. over each column). Doing this operation in parallel for each column, it yields $d_x$ column embeddings $\mathbf{h}_1, \ldots, {\mathbf{h}_d}_x$. Finally, they apply a transformer or a DeepSet to these representations, to obtain either an equivariant or invariant output.

The problem of this approach is that it introduces more invariances that you would want. It is easy to show that you can apply distinct permutations of the data separately for each input dimension to the representation above, and you would get the same output. In other words, we have killed the correlation across input features; this approach only preserves the correlation between each feature xn(i) and the outputs yn.

It is true that if the yn are unique in the context set, then in theory it is possible to reconstruct the correlations among the xn(i) using yn as the binding feature. However, it seems we are asking the network to do a lot of work in an unnatural way, and it can break in situations like periodic functions (see @st-- 's comment above).

The new proposal for CNPS

In short, we want to keep information about each point, and one way to do it is again via a sort-of relational encoding of features.

First, we build the table as above, and embed each pair (xn(i), yn) into a higher dimensional vector $\mathbf{z}_{n,i}$.
For simplicity, consider now the first column, $i = 1$. We take $\mathbf{z}_{n,1}$ and we concatenate it with $\bar{\mathbf{z}}_{n,1} \equiv \sum_{j \neq 1} \mathbf{z}_{n=1,j}$.
We perform this for each data point (and column). Each feature-output pair and data point so now also has a representation of all the other feature-output pairs of the same data point (but aggregated). In math: $$\mathbf{z}^\star_{n,i} = (\mathbf{z}_{n,i}, \bar{\mathbf{z}}_{n,i}), \qquad \text{with} \quad \bar{\mathbf{z}}_{n,i} \equiv \sum_{j \neq i} \mathbf{z}_{n,j}.$$ In other words, we are now representing each feature as itself but also we store an aggregate of "all the other input features" of the same data point (where the ordering doesn't matter, though). This additional information (which, crucially, is feature-ordering invariant) should make it much easier for the network to reconstruct the input correlations later.
Possibly we may want to add now another embedding step to the $\mathbf{z}^\star_{n,i}$ (this is not strictly needed), before aggregating over data points.
We then aggregate over each column, yielding $d_x$ column embeddings $\mathbf{h}^\star_1, \ldots, {\mathbf{h}^\star_d}_x$. This step will strongly benefit from using attention (i.e., a Transformer) as opposed to a simple DeepSet, since attention can help the network focus on the similarities across different datapoints, especially now that we are keeping additional side information.
As before, we apply a transformer to these representations, to obtain an equivariant output, i.e. $dx$ encoded embbeddings $\mathbf{r}_1, \ldots, \mathbf{r}\{dx}$ (one for each input feature).
In the CNP decoder, we calculate $\mathbf{z}^\star_{m,i}$ for each element of the test set, and then we apply a Transformer where each $\mathbf{z}^\star_{m,i}$ is concatenated with the encoded representation $\mathbf{r}_i$; then everything is aggregated.

RCNPs with equivariant inputs

Finally, we would like to apply the procedure above to RCNPs (specifically, for the translational-equivariant case; no need for the isotropic case since it is already equivariant to permutations of input features!). I describe it in a separate comment below, Solution 4 (part b).

I'll give it a closer thought once my experiments are (finally) running. But on a first reading it sounds like a reasonable approach.

Solution 4 (part b): Permutation equivariance via relational bi-dimensional deep set

See Solution 4 above for the first part, about how to implement permutation equivariance of input features in CNPs. Here I discuss the application to RCNPs, in particular for translational invariance.

RCNPs with equivariant inputs

The execution is actually quite simple. Where in the standard CNP we would define the table above:

(x1(1), y1),  (x1(2), y1), ..., (x1(d_x),y1)
(x2(1), y2),  (x2(2), y2), ..., (x2(d_x),y2)
...
(xN(1), yN),  (xN(2), yN), ..., (xN(d_x),yN)

Instead, for the RCNP we have a similar table:

(rho1(1), y1),  (rho1(2), y1), ..., (rho1(d_x),y1)
(rho2(1), y2),  (rho2(2), y2), ..., (rho2(d_x),y2)
...
(rhoN(1), yN),  (rhoN(2), yN), ..., (rhoN(d_x),yN)

where rhon(i), in math $\rho{n,i}$, is the relational encoding for the $i$-th feature of the $n$-th data point, obtained as: $$\rho_{n,i} \equiv \rho({x}_{n,i}) = \bigoplus_{n^\prime=1}^N h_\theta( {x}{n^\prime,i} - {x}_{n,i}, \mathbf{y}_{n^\prime} ),$$ where we used the difference encoding (i.e., the comparison function $g(\cdot, \cdot)$ is the difference, which is suitable to encode translational equivariance), and $h_\theta$ is, as usual, the relational encoding network.

This expression is almost identical to the standard relational encoding, with the only difference that here it is applied separately for each input feature $i$, so every data point $\mathbf{x}_n$ ends up having $d_x$ separate relational encodings (the encoding network is the same for all features, and operates on them in parallel).

After we obtain the table above, everything proceeds exactly like for the implementation of permutation equivariance for CNPs described in Solution 4 above.

acerbilab / relational-neural-processes