FoundationVision / OmniTokenizer

[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
https://www.wangjunke.info/OmniTokenizer/
MIT License
264 stars 7 forks source link

Wrong reshape order in PEG #11

Closed dreamofuture closed 4 months ago

dreamofuture commented 4 months ago

hello, when review your code, find one problem that make me confused: x = x.reshape(shape, -1) # in PEG.forward at OmniTokenizer/modules/attention.py original shape of x is [b h * w, t, d], however the target shape is [b, t, h, w, -1] when it called by dec_temporal_transformer or enc_temporal_transformer in OmniTokenizer/omnitokenizer.py conv3d after this wrong reshape has no meaning for model upon disrupted time-space info. however, the encode&decode results of he provided model seems ok, surprising.

wdrink commented 4 months ago

Thanks! Fixed already