KAIST-Visual-AI-Group / SALAD

Official implementation of SALAD (ICCV 2023).
72 stars 10 forks source link

Why only self-attention #3

Closed yshen47 closed 1 year ago

yshen47 commented 1 year ago

I find SALAD particularly interesting and relevant to my current research, thanks for open sourcing this work! One question, I notice, is that the phase 2 model use transformer encoder not decoder to encode x and cond. Is there a reason why transformer decoder is not used, i.e., no cross attention? And did you observe any performance-wise difference?

63days commented 1 year ago

Hi, Yuan Shen. Thank you for your interest in our work.

We observed the model performance drop when we mixed extrinsic information across different parts via cross-attention.

Our hypothesis is that each part contains its own somewhat independent extrinsic and intrinsic information due to the part-level disentanglement. To illustrate, think of attending to the extrinsic of the chair's "leg" when learning the intrinsic of the chair's "back", which is akin to unrelated details.

Thus, we designed the phase 2 model to ensure that each extrinsic information is delivered as independently as possible to its corresponding intrinsic. To compensate for the global information mixing, we use instead self-attention, resulting in a n encoder architecture.

Let us know if you have any further questions!

Best, Juil Koo

yshen47 commented 1 year ago

Thanks for your detailed clarification! It makes more sense to me now. One more question is about why position encoding is not used (I notice you have some configuration about position encoding but not used somehow)? My understanding of AdaLn MLP with some time-related encoding is an alternative to position encoding. But is standard position encoding not applicable in this task in your opinion?

63days commented 1 year ago

Actually, the PositionalEncoding in model_components/transformer.py is for our internal experiment related to part indexing encoding. We wanted to see how the additional part index information would affect our model, even though it wasn't used in our final method. While we observed improved SALAD performance when using the part index information, due to the inherent semantic consistency in the arrangement of pre-trained SPAGHETTI's output parts across different shapes, we have decided not to utilize this information to prevent our model from relying on this assumption.

We use TimestepEmbedder in model_components/simple_module.py for time step encoding. I wonder if what you meant by position encoding refers to this.

Thanks.

yshen47 commented 1 year ago

Yeah, that's what I referred to, thanks so much for the clarification!