hello, when review your code, find one problem that make me confused:
x = x.reshape(shape, -1) # in PEG.forward at OmniTokenizer/modules/attention.py
original shape of x is [b h * w, t, d], however the target shape is [b, t, h, w, -1] when it called by dec_temporal_transformer or enc_temporal_transformer in OmniTokenizer/omnitokenizer.py
conv3d after this wrong reshape has no meaning for model upon disrupted time-space info. however, the encode&decode results of he provided model seems ok, surprising.
hello, when review your code, find one problem that make me confused: x = x.reshape(shape, -1) # in PEG.forward at OmniTokenizer/modules/attention.py original shape of x is [b h * w, t, d], however the target shape is [b, t, h, w, -1] when it called by dec_temporal_transformer or enc_temporal_transformer in OmniTokenizer/omnitokenizer.py conv3d after this wrong reshape has no meaning for model upon disrupted time-space info. however, the encode&decode results of he provided model seems ok, surprising.