Closed ghost closed 3 years ago
Thank you so much for pointing that out!
For the modualted convolution, this is an optional part and isn't used by default (e.g. the CLEVR model doesn't use it). Second, we consider the modulation to be part of the attention layer, where it consists both of regional attention (where each latent affects some region) and a global attention (where one additional global latent attends and modulates all the features, to allow for some global visual effects over the image like style or lighting conditions). We will update the paper text to reflect this extension.
For the *2 part, you're totally correct, that's a mistake in the figure. The whole block repeats two times, rather than each operation independently (I meant to validate it vs the code before I'll upload it to the paper on arxiv and haven't get a chance to do it yet because of neurips). I'll update the figure shortly to be consistent with the code and model.
Hi,
Thank you for your great work.
I just want to mention the order of layers based on the code seems different than the figure that you shared. Shouldnt there be modulated convolutional layer before and after attention layer? Based on the code, a synthesizer block consists of conv0_up layer + attention + noise_adding + conv1 + attention + noise_adding + Resnet conv.
So why do you say there is a (up_sampling x 2) and 2 x resnet convolution after attention layer in that figure?