Open wj-on-un opened 2 weeks ago
I would like to ask if you used Depth image in the Pre-training process to train mm_projector.
I think that will be a good idea. We don't do that since we do not have enough computing resources.
If not, the shape of mm_projector in Pre-training and Fine-tuning is different, so it cannot be used.
I don't think so. mm_projector
is a linear layer and only transforms the last dimension of the input features into the correct size. What matters is the visual encoder.
Or do you mean project both image and depth image onto the trained mm_projector, then concatenate the two images, concatenate the 144 image feature tokens and the 144 depth image tokens, and then feed them into the LLM? (like show two images)
This is correct. RGB image and depth image are passed into visual encoder separately.
Hello. I have some additional questions.
Does the Affordance map shown as the result in the paper use the Affordance map generated by the last mask_decoder as is?
If you use the Threshold method, do you remove the areas with low interest and then normalize them? 2-1) Is there another way to generate a specific Affordance map?
Does the Affordance map shown as the result in the paper use the Affordance map generated by the last mask_decoder as is? If you use the Threshold method, do you remove the areas with low interest and then normalize them?
We do not threshold it. Instead, we sigmoid it and get the probability map (0-1). This is important for affordance map evaluation, since the metric only works for a distribution. If we threshold it, we can get some reasonable affordance map in visualization, but the metric will be really bad.
Is there another way to generate a specific Affordance map?
Can you explain what's a "specific" affordance map? Is it for specific objects?
Thank you. I think it generates well without setting a separate threshold value by applying sigmoid to the learned Mask Decoder prediction value.
The reason I mentioned a specific method was because I was asking if there was another way to deal with the blocking artifacts that occurred in the generated Affordance map.
I used the mask decoder structure of the SAM model. OWL-ViT -> neck -> text_hidden_fcs -> prompt_encoder -> mask_decoder. And as in the paper, I added one transposed convoluation layer as follows.
(output_upscaling): Sequential( (0): ConvTranspose2d(256, 128, kernel_size=(2, 2), stride=(2, 2)) (1): LayerNorm2d() (2): GELU(approximate='none') (3): ConvTranspose2d(128, 64, kernel_size=(2, 2), stride=(2, 2)) (4): GELU(approximate='none') (5): ConvTranspose2d(64, 32, kernel_size=(2, 2), stride=(2, 2)) (6): GELU(approximate='none') )
I see. Yes I've noticed similar artifacts during some of my experiments. There are some of my findings:
As you can see here, I don't have a perfect solution for this. But these findings might be helpful for your exploration. If you are reporting AffordanceLLM as a baseline, you don't need to worry about the artifact and you can report it as is.
Hello, I have a question.
First of all, I would like to ask if you used Depth image in the Pre-training process to train mm_projector.
If not, the shape of mm_projector in Pre-training and Fine-tuning is different, so it cannot be used.
When concatenating Image Feature (batch, sequence, dimension) and Depth Image Feature (batch, sequence, dimension), the Feature shape becomes (batch, sequence, dimension * 2).
Or do you mean project both image and depth image onto the trained mm_projector, then concatenate the two images, concatenate the 144 image feature tokens and the 144 depth image tokens, and then feed them into the LLM? (like show two images)