Proof of Property (2) in Theorem 1

lzb5600 commented 1 year ago

Hi, I found there is an error in proof of Theorem 1, regarding Property (2)

The inequality between arithmetic mean and geometric mean is mistakenly used. “δ ∈ R+ is exactly zero i.i.f. the density estimated by the model is permutation invariant”. The loss attains its minima doesn’t mean the equality holds. The claim is right only if you assume the arithmetic mean sums to a constant regardless of optimisation. I was wondering if you could help clarify this, in case I am missing something?

aldro61 commented 1 year ago

Hello Zhibin,

Thank you for your interest in our work. After discussing with my co-authors, we do not understand the issue that you raise. Could you please clarify what you mean?

Here are a few key points of the proof:

We assume an estimator with sufficient capacity, so any function is learnable.
We know that our model’s architecture is technically capable of achieving permutation invariance.
If the equality holds, it means that delta = 0 and that we have permutation invariance.
Since we have enough capacity to learn a permutation invariant function, then necessarily delta=0 at the minimum, equality holds, and we have permutation invariance.

Kind regards, Alexandre

lzb5600 commented 1 year ago

Hi Alexandre,

Thanks for your reply.

I am concerned about the following point mentioned:

“Since we have enough capacity to learn a permutation invariant function, then necessarily delta=0 at the minimum, equality holds, and we have permutation invariance.”

AFAIK, "permutation invariance" or “delta=0” is not a necessary condition for the loss attaining its minima, i.e. the geometric mean may attain its maximum without the equality holds. Could you elaborate more on why this is a necessary condition?

Cheers, Zhibin

aldro61 commented 1 year ago

Thanks for clarifying. Here's the rationale:

Each term in the geometric mean corresponds to the density function of one factorization ($\pi$) of the output distribution
We assume infinite capacity, so the estimator is able to fit each of these distributions perfectly well, thereby maximizing the terms in the geometric mean or, alternatively, each term in the arithmetic mean:
Suppose that all terms in the mean are maximized, then the only possible way to continue decreasing the value of the objective is to reduce $\delta$.
Since the model has infinite capacity and its architecture does not prevent it from reaching permutation invariance, then $\delta$ must go to 0.

Alex

lzb5600 commented 1 year ago

I think the statement "Suppose that all terms in the mean are maximized, then the only possible way to continue decreasing the value of the objective is to reduce $\delta$" could be problematic, Note that "reduce $\delta$" doesn't necessarily lead to larger mean, and a better loss is not guaranteed when $\delta=0$.

aldro61 commented 1 year ago

I do not see how this is problematic. If you have perfectly fitted all the factorizations, then the only thing that could be done to further reduce the value of the objective is to minimize delta, which is equivalent to achieving permutation invariance (for delta = 0).

If you still believe that there is an issue with the proof, could you please provide a counter-example?

Thanks, Alex

lzb5600 commented 1 year ago

Minimising $\delta$ is not an independent process and it can change value of the geometric/arithmetic mean. The statement will hold if you assume the arithmetic mean is constant when you keep optimising for $\delta=0$, which is not guaranteed.

For example, let's say the factorization $-(a+b)/2 \leq -\sqrt{ab}$. When $a$ and $b$ are fitted to 0.1 and 0.5, there is not guarantee that in further optimisation we will get $a=0.3$ and $b=0.3$, unless you assume $a+b=0.6$. Indeed, you may not even be able to further optimise from $a=0.1$ and $b=0.5$.

marcotet commented 1 year ago

You are correct that minimizing $\delta$ is not an independent process. However, it is useful to look at what happens when you are at the global optimum of the objective function.

Let's assume that we use the maximum likelihood method. Let's say the correct PDF is $p(x)$, and we write our forecast as the arithmetic mean of two functions: $f(x) = (a(x) + b(x)) / 2$. Then when we reach the maximum likelihood, we have $f(x) = p(x)$. However, $a(x)$ and $b(x)$ can be anything, as long as their sum is constant. So we don't have permutation invariance with this function.

However, what happens if our objective use the geometric mean of the two functions? Then we have $g(x) = \sqrt{a(x) b(x)}$. If we compute the expected likelihood with this distribution, then it will always be worse than the one for $f(x)$, as long as $a(x) \neq b(x)$. Therefore, if we are to reach the maximum likelihood with the geometric mean, then we need $a(x) = b(x)$ and we thus have permutation invariance.

Does this clarify the reasoning behind the theorem? Your point about $a + b$ being constant is quite important since it is actually the case when you reach the correct forecast.

lzb5600 commented 1 year ago

Hi marcotet, thanks a lot for you reply.

The problem here is that we are not guaranteed that $a(x) = b(x)$. They sum to a number at certain point is different from that we constrain them sum to a number in optimisation.

"However, $a(x)$ and $b(x)$ can be anything, as long as their sum is constant." I politely disagree with this. $a(x)$ and $b(x)$ are not free to change while still keeping their sum being constant. For example, we could have optimum at $a(x)=2$, $b(x)=8$, but the best we could do under the constrain $a(x)=b(x)$ may be $a(x)=3$, $b(x)=3$. In other words, because we are not assuming derivatives of $a(x)$ and $b(x)$ are inverse, or their slops are inverse to each other, increasing $a(x)$ by 1 can decrease $b(x)$ by 5.

marcotet commented 1 year ago

The theorem requires quite strong hypothesis to work. Amongst them is that the model has infinite capacity, so it can effectively model any distribution. Another one is that it is able to reach the global minimum of its loss function.

Using the simplified model with $a(x)$ and $b(x)$, we can limit ourselves to studying the case where $a(x) + b(x)$ is constant, since this constraint is equivalent to saying that we have the global minimum of the arithmetic version of the loss function (the negative log-likelihood of the distribution). This minimum is a lower bound of the geometric version of the loss function (the one we in our model). So if there is a $a(x)$, $b(x)$ pair such that both arithmetic and geometric versions are equal, then it is guaranteed to be the global minimum of the geometric version.

From the infinite-capacity hypothesis, we can have $a(x) = b(x)$ on that constraint. Then, from the inequality between the arithmetic and geometric mean, we can prove that there are no global minima without permutation invariance.

If you are unconvinced, we should probably continue this discussion verbally, using a Zoom call with a whiteboard, so please send me an email if you want to set a meeting to do so.

lzb5600 commented 1 year ago

The claim on infinite capacity for any "distribution" is a contradict. Permutation invariance is a prerequisite for that we are modelling "distributions" (i.e. valid copula in this case), and if you allow such "infinite capacity" to model any distribution, then you have already assumed permutation invariance without proof.

I have sent you an email and let's find a time to clarify this.

Thanks, Zhibin

marcotet commented 1 year ago

Thanks Zhibin for taking the time to meet on Zoom. While we couldn't iron out everything, I believe we have much better understanding of each other arguments.

One important issue raised is that the paper did not prove that in the infinite capacity limit, that our attentional copula could encode independent distributions for each permutation. This is important because if it can, then it is straightforward to prove that at the global optimum, it will reach the ground truth for any distribution, which is then trivially permutation invariant.

To prove that the infinite capacity version of the model can encode independent distributions for each permutation, we first need to realize that we don't actually need to detect the full permutation. Since we decode one token at a time, we only need to have the transformer detects which tokens have already been decoded. This should be doable by having a constant key for all prediction tokens, and a unique value embedding for each of them. Using this result, the model would then be able to uniquely adapt its output on which tokens had already been decoded. Therefore, we should be able to prove that in that limit the architecture can create a model which reach the desired distribution ground-truth for all permutations, which would then be permutation-invariant and be a proper copula.

ServiceNow / TACTiS

Proof of Property (2) in Theorem 1 #16