Open yangdongchao opened 2 years ago
I'm not quite sure to be honest as the authors don't seem to give any information on that. As a result, I would expect that it's not that different from the original VQGAN paper. Take a look here: https://github.com/CompVis/taming-transformers/blob/master/taming/models/cond_transformer.py#L80
The conditional information is just prepended and I would guess it's the same happening here. The Bidirectional Transformer has a context window of 512 and an image has 256 tokens so it would be straightforward to insert the conditional information at the beginning. For example take a segmentation map. You encode it to the latent space and have the first 256 tokens to be the encoded segmenation map and the rest of the 256 are masked tokens. That why you can train the condition. This applies also to simple class condition.
Thank you very much. I understand it.
Hi, thanks for your open source. It is a great work. I want to ask a question about this paper. The Bi-directional Transformers is trained without any conditional input, it just try to predict the masked token. But when we inference it, such as use the model to do a Class-conditional Image Synthesis task. How the class condition information can be used?