Hi, great job! I would like to know that, you embedded the camera and box conditions to cross attention together with text embedding, but didn't finetune the cross attention, why? Will input that is different from the original text embedding cause any problems?
Hi, great job! I would like to know that, you embedded the camera and box conditions to cross attention together with text embedding, but didn't finetune the cross attention, why? Will input that is different from the original text embedding cause any problems?