About the layer ordering in the figure

Thank you so much for pointing that out!

For the modualted convolution, this is an optional part and isn't used by default (e.g. the CLEVR model doesn't use it). Second, we consider the modulation to be part of the attention layer, where it consists both of regional attention (where each latent affects some region) and a global attention (where one additional global latent attends and modulates all the features, to allow for some global visual effects over the image like style or lighting conditions). We will update the paper text to reflect this extension.

For the *2 part, you're totally correct, that's a mistake in the figure. The whole block repeats two times, rather than each operation independently (I meant to validate it vs the code before I'll upload it to the paper on arxiv and haven't get a chance to do it yet because of neurips). I'll update the figure shortly to be consistent with the code and model.

dorarad / gansformer

About the layer ordering in the figure #18