Closed jk2569 closed 4 years ago
Hi,
Thank you for your question. Yes, the comment in your edit makes sense - as long as you're feeding the result of MaxMin or GroupSort to a linear layer, the specific ordering by which the mins and maxes are ordered won't matter - feel free to choose whichever implementation you like.
Cem
Thanks! I have another question. With regards to convolution layers, your group had a follow up paper, BCOP. However, you have a Bjorck orthogonalization implementation for convolution layers in this repo. Do you recommend using one over the other, and if so, why?
Definitely use the methods from our follow-up work! We identify and remedy some of the problematic issues related to naively implementing Lipschitz convolution operations.
Will do! Could you clarify which problems with the naive Lipschitz convolution implementation you are talking about? As in, what's wrong with the convolution done here and what does BCOP correct?
There are a number of things that the new Lipschitz convolution operation fixes, but I think one of the important contributions is that the BCOP layer successfully parametrizes orthogonal convolutions and leads to layers that always preserve gradient norm, where the convolutions implemented here (although strictly Lipschitz) might lose gradient norm. We have some comparisons in the follow-up paper. You might also be interested in our analysis of the surprising difficulties that arise when one tries to optimize over orthogonal convolutions that we didn't know about when writing our first paper.
Awesome. Thank you!
Given an input of size (B, d), the behavior of GroupSort(d//2) and MaxMin(d//2) should be the same. However, a simple test will show that they do not produce the same results. In fact, in the code for MaxMin, the forward function simply concatenates the maxes and mins, thereby not having the interleaved max-min-max-... structure that is given in GroupSort. A bit confused because I'm expecting them to have the same behavior (with GroupSort being the correct implementation).
EDIT: Perhaps it might be okay since the output of MaxMIn is simply a permutation of the output of GroupSort for group size of 2. If a linear layer follows, this is essentially just permuting the rows of the weight matrix. Could someone confirm this?