FoundationVision / OmniTokenizer

[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
https://www.wangjunke.info/OmniTokenizer/
MIT License
263 stars 7 forks source link

Wrong reshape order in PEG still exists #12

Closed dreamofuture closed 4 months ago

dreamofuture commented 4 months ago

thanks for your attention to this problem, but seems not solved completely after your correction. Peg->forward->return x.reshape(orig_shape): x.shape is [B, THW, C], but orig_shape is [BHW, T, C] or [BT, HW, C] I fixed locally and the provided models works almost ok, but the vieo-reconstructed-result is incoherent, are these models need retrain ?

and two other problems:

  1. q_stride in do_pool and Attention seems used for spatial downsample, but i can't understand how the code below works: [B, HW, ...].view(B, q_stride, -1, ...).max(dim=1).values though q_stride always being 1 in your code will not cause error, but it makes confusion
  2. when use scaled_dot_product_attention for temporal attention, seems no temporal-pos added to q/k
wdrink commented 4 months ago

Really appreciate your findings, we did mis-process the tensor shape in PEG, the code has been updated and will update the checkpoints afterwards. As you mention, q_stride is not used in our code, I remove it to avoid the misunderstanding.

wdrink commented 4 months ago

Feel sorry for the confusion, will temporarily roll back to the training code to obtain better reconstruction results.