Archy's feedback - Githubissues

[ ] (Archy) I found the maths in [the bilinear transformation] section a bit confusing, probably because I'm not very good at maths, but also because of the number of different nomenclatures flying around. Some visualizations will hopefully help (and look like they're in the pipeline).
[ ] We should discuss the connection to highway networks, which uses sigmoidal gating (like is used in LSTMs) in the layers of a very deep network.
[x] The LSTM and PixelCNN self-conditioning explanations are fragmented and list-like.
[x] The phrase "Going backwards from the definition of each computation mechanism, we will now explain how they can be expressed in terms of generalized bilinear transformations." is a bit jarring.
[x] Add a supporting example to "It could also be that an image needs to be processed in the context of a question being asked."
[x] "Attention is a good example of one such principled approach: side information is used [...]" -> s/side/contextual
[x] Add a supporting example to "For instance, in conditional decoder-based generative models, we would like to map a source of noise to model samples in a way that is class-aware."
[x] The use of both feature-wise affine transformations and feature-wise linear modulation is confusing.
[x] (Archy) I think [starting with concatenation] is very useful, i.e. "Let's think of the dumbest solution possible and examine why it doesn't work." I'd be tempted to move this further up the article, to give a simple concrete example of how you might incorporate contextual details. You can then explain the shortcomings and invoke FiLM as the solution. This will also help clear up some ambiguity for the naive reader as to whether you're considering this 'concatenate and forget' solution as an example of FiLM or not (it's not, right, because there is no modulation of existing features, you just add a bunch more?).
[x] (Archy) I see we're working towards the definition of a FiLM that you gave earlier -- composed of a biasing and a scaling -- but that progress is not particularly limpid. Could you restructure to make it clear that you're talking about particular parts of the FiLM definition i.e. have subtitles like "+B(Z)" or "Y(x) . x" or something like that? It'd be helpful to more clearly embed the literature review in an exploration of the equation.
[x] "Several variants of FiLM can be found in the literature." -> We need to make it clear that we've been discussing PARTS of the FiLM formulation, and now we're discussing models that use all of the bells and whistles.
[x] "So far, the distinction between the FiLM generator and the FiLM-ed network has been rather clear, but it is not strictly necessary." -> To make it even more clear, we could spell it out here: "We've had one network which outputs parameters for the transformation, and these are applied to the layers of a second network."
[x] "By feature-wise, we mean that scaling and shifting are applied element-wise, or in the case of convolutional networks, feature map-wise." -> (Archy says) This is a little confusing. Element-wise is a mathematical statement, but feature map-wise is a statement about what that unit of the network represents. We should explain the level of granularity in convolutional neural networks vs. fully-connected networks. In CNNs, a feature map is the same feature observed at different spatial locations.
[x] Define gamma and beta before introducing them in the first equation of the article.
[x] When discussing multiplicative interactions, it would be helpful to introduce the term 'conditional scaling' at this point rather than later, and to clarify the relationship to conditional biasing.
[x] Add a "spoiler" sentence connecting CBN and FiLM when introducing CBN.
[x] Merge the sentence "We can also use FiLM layers to condition a style transfer network on a chosen style image." into the next paragraph.
[x] Should there be a sub-heading for self-conditioned models?
[x] (Archy) Can you say something about how squeeze-and-excitation differs from the norm? i.e. all layers are conditioned on previous layers in a vanilla NN... I think this description is a little underspecified. Isn't SE more about allowing between-channel interactions than between-layer interactions?
[x] Beef up the conclusion. What is it that our new formulation brings to the table? Why is thinking of these different things within a single family of bilinear transformations a useful exercise? Are there open questions we would like to see answered?

distillpub / post--feature-wise-transformations

Archy's feedback #65