Open Ted-developer opened 5 months ago
I have the same question. In the paper the outputs from Ref_Net and UNet are concat in the width dimension, but the code: bank_fea = [ rearrange( d.unsqueeze(1).repeat(1, video_length, 1, 1), "b t l c -> (b t) l c", ) for d in self.bank ] modify_norm_hidden_states = torch.cat( [norm_hidden_states] + bank_fea, dim=1 )
Great work, but I have some questions: From the code, it seems that the way refnet is connected to unet is similar to magic, but the anyone paper doesn't seem to be like this. Also, the handling of spatial attention is the same as in magic. Or did I misunderstand something? Please help clarify.