关于并行解码的疑问(Questions About Parallel Decoding)

helloheshee commented 2 months ago

您好, 感谢您杰出的工作. 我对于并行解码和用交叉熵损失仍有疑问, 这假设了同一分辨率下每个token是独立的, 这是否能或为什么能对一个分辨率下的token良好建模. 假设一个场景, 如果总共只有两个样本: X0: (1, 2), X1: (2, 1), 如果用长度为2的查询向量并行解码, 如果用独立的交叉熵进行训练, 则模型学到的分布就是查询向量->[0.5, 0.5], 而这样在后面采样时会等概率地采样到(1, 1), (1, 2), (2, 1), (2, 2), 这样就和真实分布不一样了. 不知是否我哪里理解的有问题, 还请不吝赐教.

(Thank you for your outstanding work. I still have some doubts regarding parallel decoding and the use of cross-entropy loss. This assumes that each token is independent at the same resolution, which raises the question of whether and why this approach can effectively model tokens at a given resolution.

Consider a scenario where there are only two samples: X0: (1, 2) and X1: (2, 1). If we use a query vector of length 2 for parallel decoding and train with independent cross-entropy, the distribution the model learns would be a uniform one, mapping the query vector to ->[0.5, 0.5]. This results in an equal probability of sampling (1, 1), (1, 2), (2, 1), and (2, 2) during subsequent sampling, which deviates from the true distribution. I am unsure if there's a misunderstanding on my part, and I would greatly appreciate your guidance on this matter.)

keyu-tian commented 2 months ago

In the case of 'X0: (token0=1, token1=2), X1: (token0=2, token1=1)', token0 and token1 are not independent to each other. But in VAR, we can assume each token within r_i scale can be independent to each other.

If you think about the fact that all token maps (r1, r2, ..., rK) should be summed up to get a complete \hat{f} (or in other words, we decompose an image feature map f into K components r1 to rK), this "independency" of tokens on a particular r_i can be acceptable. When you sample ri, all preceding components (r1, r2, ..., r{i-1}) are visible, allowing you to reasonably assume independence among the tokens within r_i, that is, p(r_i^{1,1}, r_i^{1,2}, ..., ri^{h,w} | r{<i}) = p(ri^{1,1} | r{<i}) p(ri^{1,2} | r{<i}) ... * p(ri^{h,w} | r{<i}). Under this assumption, the parallel decoding is OK.

Why this assumption is reasonable: To decode ri, depending on all r{<i} is enough, due to the locality of the multi-scale token map pyramid. Considering you're generating the top of a pyramid (a single token in r_i). You don't need to look at the tops of other pyramids (the other tokens in ri), because you can already see all bottom tokens of all r{<i}, and they can provide enough information.

So in short, in the context of VAR, tokens at a given ri level can be assumed independent (but they can still condition on all tokens in preceding token maps r{<i}), which addresses the concern you mentioned.

I think this assumption will be broken if VAR only has 1 scale. But it holds when VAR has many scales.

helloheshee commented 2 months ago

In the case of 'X0: (token0=1, token1=2), X1: (token0=2, token1=1)', token0 and token1 are not independent to each other. But in VAR, we can assume each token within r_i scale can be independent to each other.

If you think about the fact that all token maps (r1, r2, ..., rK) should be summed up to get a complete \hat{f} (or in other words, we decompose an image feature map f into K components r1 to rK), this "independency" of tokens on a particular r_i can be acceptable. When you sample ri, all preceding components (r1, r2, ..., r{i-1}) are visible, allowing you to reasonably assume independence among the tokens within r_i, that is, p(r_i^{1,1}, r_i^{1,2}, ..., ri^{h,w} | r{<i}) = p(ri^{1,1} | r{<i}) p(ri^{1,2} | r{<i}) ... * p(ri^{h,w} | r{<i}). Under this assumption, the parallel decoding is OK.

Why this assumption is reasonable: To decode ri, depending on all r{<i} is enough, due to the locality of the multi-scale token map pyramid. Considering you're generating the top of a pyramid (a single token in r_i). You don't need to look at the tops of other pyramids (the other tokens in ri), because you can already see all bottom tokens of all r{<i}, and they can provide enough information.

So in short, in the context of VAR, tokens at a given ri level can be assumed independent (but they can still condition on all tokens in preceding token maps r{<i}), which addresses the concern you mentioned.

I think this assumption will be broken if VAR only has 1 scale. But it holds when VAR has many scales.

感谢您快速且详细的回复. 关于这点还有一个疑问就是, 在模型并行解码时实际上是不独立的, 因为并行解码用到了自注意, 各个token在此过程中产生了关联, 但是在计算交叉熵损失以及采样时却是潜在假设了同一层的各个token之间是独立的. 这种不一致性正是我惊讶其有效的地方. 另外, 虽然在较高的r_i层级的确可以将前面的层级作为条件保持各个token间的独立性, 但是在较低层级时条件较少, 此时很难说较低层级的各个token间还能独立采样. 我试图查看生成的demo样张找到这种采样方式可能会导致异常的情况, 但却发现其工作的很好, 因此我猜测VAR这种架构在以类别为条件且类间差异较小的情况能很好地工作, 但是我怀疑它在无条件生成或类间差异较大地数据集是否还能有效.

(Thank you for your prompt and detailed response. I have another query regarding this matter: during the parallel decoding process in the model, the tokens are not actually independent because self-attention is used, which establishes correlations among the tokens. However, when calculating the cross-entropy loss and during sampling, there seems to be an underlying assumption that the tokens within the same layer are independent. This inconsistency is precisely what surprises me about its effectiveness. Additionally, while it is true that at higher layers (denoted as r_i), the model can maintain the independence of tokens by conditioning on the previous layers, at lower layers where there are fewer conditions, it becomes challenging to argue that the tokens can still be sampled independently. I attempted to examine the generated demo samples to find instances where this sampling method might lead to anomalies, but I found that it works quite well. Therefore, I speculate that the VAR architecture might perform well when conditioned on categories with small inter-class differences, but I am skeptical about its effectiveness in unconditional generation or on datasets with larger inter-class differences.)

keyu-tian commented 2 months ago

thank you @helloheshee for such insightful thoughts! I agree with that the self-attentions can be helpful for enhancing interactions among tokens, especially for those early scales. As for the inter-class difference, I would suggest you to try a lower cfg (e.g., 1.5). I feel VAR still demonstrates good diversity even within one class. We just need a lower cfg.

FoundationVision / VAR

关于并行解码的疑问(Questions About Parallel Decoding) #24