LTH14 / mar

PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
MIT License
1.05k stars 59 forks source link

Difference Between MAR and MAGE #30

Open JeremyCJM opened 2 months ago

JeremyCJM commented 2 months ago

Hi Tianhong, thank you for your inspiring work! While reading the paper, I had some questions regarding the term “MAR.” Aside from the difference mentioned in the paper—where the next set of tokens in MAR is sampled randomly, unlike MAGE—are there any other differences between MAR and MAGE in terms of masked modeling? Could MAGE and MaskGIT also be considered “Masked Auto-Regressive” models? If that’s the case, it seems like the auto-regressive nature in MAR is already encompassed within the masked modeling process itself. I would appreciate any clarification on this point.

LTH14 commented 2 months ago

Thanks for your interest!

Are there any other differences between MAR and MAGE in terms of masked modeling?

The masking design for MAR and MAGE is almost the same: a variable masking ratio is sampled from a truncated Gaussian distribution. One small difference is MAR uses a fully sparse encoder similar to MAE, while MAGE still has mask tokens (which is actually because of a legacy issue in MAGE's JAX implementation).

Could MAGE and MaskGIT also be considered “Masked Auto-Regressive” models?

Although in the literature, MaskGIT and MAGE call themselves as "non-autoregressive" models, we believe that MAR, MaskGIT, and MAGE are all conceptually "autoregressive models", where new tokens are generated based on existing tokens.

Auto-regressive vs. masked modeling

First of all, I'd like to clarify that masked modeling alone does not encompass "autoregressive modeling" -- for example, as a typical masked modeling method, MAE cannot perform autoregressive modeling of images. "Masked modeling" is a way to train a model, while "autoregressive modeling" is a way to model a distribution and sample from it.

Traditionally, GPT-like models (e.g., VQGAN, VAR) are considered "auto-regressive," while BERT-like models are considered "masked modeling" (e.g., MAR, MaskGIT, MAGE). However, in this paper, we want to clarify that both GPT-like and BERT-like models are, in fact, auto-regressive. They are different in their training methods: BERT-like models are trained using "masked modeling" (MLM loss), whereas GPT-like models are trained using "teacher forcing" (LM loss). Additionally, the attention mechanisms (bidirectional vs. causal) and generation orders (raster vs. random) differ, as discussed in the paper. However, these differences do not affect the auto-regressive nature of either model -- generating new tokens from existing tokens.

JeremyCJM commented 2 months ago

Awesome! Thanks a lot for your prompt and detailed reply!

zhaoyanpeng commented 2 months ago

Hi Tianlong, awesome work, thank you!

Could you say a bit more on: (1) how is "fully randomized order" implemented in MAR at training time and at test time, respectively, and how does it address the gap between training-time and inference-time behavior of MAGE?

Thanks, - Y

LTH14 commented 2 months ago

@zhaoyanpeng Thanks for your interest! In both MAR, MAGE (and MaskGIT), during training, we randomly mask out some tokens and ask the model to reconstruct them. During inference, the generation order of MAR is also fully random. However, in MAGE and MaskGIT, the order of generation is determined by the confidence of each iteration's predicted tokens. They will generate the most confident k tokens, instead of randomly select k tokens from the remaining tokens to generate.

zhaoyanpeng commented 2 months ago

Get it! thank you for your prompt and detailed explanation.