GroupSort and MaxMin do not give same behavior for groupsize=2

cemanil / LNets

Lipschitz Neural Networks described in "Sorting Out Lipschitz Function Approximation" (ICML 2019).

54 stars 19 forks source link

GroupSort and MaxMin do not give same behavior for groupsize=2 #8

Closed jk2569 closed 4 years ago

jk2569 commented 4 years ago

Given an input of size (B, d), the behavior of GroupSort(d//2) and MaxMin(d//2) should be the same. However, a simple test will show that they do not produce the same results. In fact, in the code for MaxMin, the forward function simply concatenates the maxes and mins, thereby not having the interleaved max-min-max-... structure that is given in GroupSort. A bit confused because I'm expecting them to have the same behavior (with GroupSort being the correct implementation).

EDIT: Perhaps it might be okay since the output of MaxMIn is simply a permutation of the output of GroupSort for group size of 2. If a linear layer follows, this is essentially just permuting the rows of the weight matrix. Could someone confirm this?

cemanil commented 4 years ago

Hi,

Thank you for your question. Yes, the comment in your edit makes sense - as long as you're feeding the result of MaxMin or GroupSort to a linear layer, the specific ordering by which the mins and maxes are ordered won't matter - feel free to choose whichever implementation you like.

Cem

jk2569 commented 4 years ago

Thanks! I have another question. With regards to convolution layers, your group had a follow up paper, BCOP. However, you have a Bjorck orthogonalization implementation for convolution layers in this repo. Do you recommend using one over the other, and if so, why?

cemanil commented 4 years ago

Definitely use the methods from our follow-up work! We identify and remedy some of the problematic issues related to naively implementing Lipschitz convolution operations.

jk2569 commented 4 years ago

Will do! Could you clarify which problems with the naive Lipschitz convolution implementation you are talking about? As in, what's wrong with the convolution done here and what does BCOP correct?

cemanil commented 4 years ago

There are a number of things that the new Lipschitz convolution operation fixes, but I think one of the important contributions is that the BCOP layer successfully parametrizes orthogonal convolutions and leads to layers that always preserve gradient norm, where the convolutions implemented here (although strictly Lipschitz) might lose gradient norm. We have some comparisons in the follow-up paper. You might also be interested in our analysis of the surprising difficulties that arise when one tries to optimize over orthogonal convolutions that we didn't know about when writing our first paper.

jk2569 commented 4 years ago

Awesome. Thank you!