Closed yshen47 closed 1 year ago
Hi, Yuan Shen. Thank you for your interest in our work.
We observed the model performance drop when we mixed extrinsic information across different parts via cross-attention.
Our hypothesis is that each part contains its own somewhat independent extrinsic and intrinsic information due to the part-level disentanglement. To illustrate, think of attending to the extrinsic of the chair's "leg" when learning the intrinsic of the chair's "back", which is akin to unrelated details.
Thus, we designed the phase 2 model to ensure that each extrinsic information is delivered as independently as possible to its corresponding intrinsic. To compensate for the global information mixing, we use instead self-attention, resulting in a n encoder architecture.
Let us know if you have any further questions!
Best, Juil Koo
Thanks for your detailed clarification! It makes more sense to me now. One more question is about why position encoding is not used (I notice you have some configuration about position encoding but not used somehow)? My understanding of AdaLn MLP with some time-related encoding is an alternative to position encoding. But is standard position encoding not applicable in this task in your opinion?
Actually, the PositionalEncoding
in model_components/transformer.py
is for our internal experiment related to part indexing encoding. We wanted to see how the additional part index information would affect our model, even though it wasn't used in our final method. While we observed improved SALAD performance when using the part index information, due to the inherent semantic consistency in the arrangement of pre-trained SPAGHETTI's output parts across different shapes, we have decided not to utilize this information to prevent our model from relying on this assumption.
We use TimestepEmbedder
in model_components/simple_module.py
for time step encoding.
I wonder if what you meant by position encoding refers to this.
Thanks.
Yeah, that's what I referred to, thanks so much for the clarification!
I find SALAD particularly interesting and relevant to my current research, thanks for open sourcing this work! One question, I notice, is that the phase 2 model use transformer encoder not decoder to encode x and cond. Is there a reason why transformer decoder is not used, i.e., no cross attention? And did you observe any performance-wise difference?