Why single-level optimization?

KatsarosEf commented 1 year ago

Hello and congrats for your great work! I would like to ask if by any chance you had attempted to use bi-level optimization for your proposed "learning-to-attend" scheme? From my understanding (please correct me if I am wrong), you converge at different types of attention at each run and you select the optimal inter-task type of attention with majority voting across five runs (i.e. 3x global 1x local 1x S-label). Would bi- instead of single-level optimization remedy this issue? Would you have any insights yourself on that?

brdav commented 1 year ago

Hi, you are right that we use majority voting over five runs. For optimization we mostly followed the NAS strategy of SNAS (Xie et al.), which uses single-level optimization and Gumbel-Softmax. Due to the Gumbel Softmax sampling variance, the architecture estimates have some variance, but I'm not sure how bi-level optimization could help to reduce this. I have not tried it though.

KatsarosEf commented 1 year ago

In my understanding bi-level optimization helps you explore the joint hyper-parameter space better. DARTS (deterministic NAS) experiments with single-level opt and results are significantly worse than their bi-level setup for both first and second order optimization approaches. SNAS (stochastic NAS) shows they don't need bi-level optimization to outperform DARTS but state: "It is interesting to note that with same single-level optimization, SNAS significantly outperforms DARTS. Bilevel optimization could be regarded as a data-driven meta-learning method to resolve the bias proved above". Adashare (Sun et al.) and a few follow-up papers use bi-level optimization on Gumbel-Softmax variables. In any case, thanks for your reply, I hope I'll be able to shed some light.

brdav / atrc

Why single-level optimization? #6