Open zheedong opened 3 months ago
blocks is used to reconstruct causal embedding, and is only used during training to serve as a training objective, while blocks_for_image is used to reconsturct the clip image features, which can be decoded into realistic images with SD U-Net.
In get_codebook_indicies (in 'qformer_quantizer.py'), we only apply blocks_for_image to obtain the reconstructed image features, so that discrete tokens can be decoded into images.
Why 'blocks' is needed? Why don't you unify 'blocks' and 'blocks_for_image'? Or why not reconstruct Causal Embedding through blocks, then apply blocks_for_image?
Hi, in tokenizer training, you apply blocks for reconstruction causal embedding, and apply blocks_for_image (in 'blip2_qformer_codebook_all_image.py'). But you apply only blocks in get_codebook_indicies (in 'qformer_quantizer.py'). Why is it difference here?