How to discuss feature-wise multiplicative vs. additive interactions

ethanjperez commented 6 years ago

In the "Attention over features" section, it's stated that "Both papers [on gated-attention] show improvement over conditioning via concatenation." Overall, we bring up feature-wise biasing and feature-wise scaling, so it's worth chatting about how we want to handle this discussion.

Should we discuss performance comparisons between concatenation/feature-wise biasing and feature-wise multiplication/gating? It seems like this isn't the point of our paper, and that these distinctions might distract readers; we don't want to reinforce that multiplicative vs. additive interactions is the main difference (which isn't what the FiLM paper found), but rather that the feature-wise aspect is important.

I'd be in favor of replacing sentences like "Both papers [on gated-attention] show improvement over conditioning via concatenation." with something more specific on how the feature-wise aspect helps learning. I.e. "\cite{Gated-Attention Architectures for Task-Oriented Language Grounding} shows that feature-wise modulation enables agents to learn to follow language instructions in reinforcement learning by modulating which features in its visual pipeline/convolutional neural network will be important for its downstream policy network for instruction-following. Only features relevant to the particular categories of object types referenced by the language instruction are significantly activated."

I think this is my fault (I must have included this information in the literature review), but I just realized it reading over the draft.

harm-devries commented 6 years ago

we don't want to reinforce that multiplicative vs. additive interactions is the main difference (which isn't what the FiLM paper found), but rather that the feature-wise aspect is important.

I think what's missing is an explanation that concatenation is a form of feature-wise biasing, so we can point out that previous work found that multiplicative interactions perform better than additive interactions. We can then add that the FiLM paper found that having both multiplicative and additive interactions leads to the best performance.

Btw, what do you mean by the feature-wise aspect is important?

vdumoulin commented 6 years ago

I think the recent reorganization of the related work section renders this issue obsolete. I'll go ahead and close it but feel free to reopen it if you disagree.

distillpub / post--feature-wise-transformations

How to discuss feature-wise multiplicative vs. additive interactions #41