JasonQSY / AffordanceLLM

Code for "AffordanceLLM: Grounding Affordance from Vision Language Models"
0 stars 0 forks source link

Concatenate Feature #2

Open wj-on-un opened 2 weeks ago

wj-on-un commented 2 weeks ago

Hello, I have a question.

First of all, I would like to ask if you used Depth image in the Pre-training process to train mm_projector.

If not, the shape of mm_projector in Pre-training and Fine-tuning is different, so it cannot be used.

When concatenating Image Feature (batch, sequence, dimension) and Depth Image Feature (batch, sequence, dimension), the Feature shape becomes (batch, sequence, dimension * 2).

Or do you mean project both image and depth image onto the trained mm_projector, then concatenate the two images, concatenate the 144 image feature tokens and the 144 depth image tokens, and then feed them into the LLM? (like show two images)

JasonQSY commented 2 weeks ago

I would like to ask if you used Depth image in the Pre-training process to train mm_projector.

I think that will be a good idea. We don't do that since we do not have enough computing resources.

If not, the shape of mm_projector in Pre-training and Fine-tuning is different, so it cannot be used.

I don't think so. mm_projector is a linear layer and only transforms the last dimension of the input features into the correct size. What matters is the visual encoder.

Or do you mean project both image and depth image onto the trained mm_projector, then concatenate the two images, concatenate the 144 image feature tokens and the 144 depth image tokens, and then feed them into the LLM? (like show two images)

This is correct. RGB image and depth image are passed into visual encoder separately.

wj-on-un commented 1 week ago

Hello. I have some additional questions.

  1. Does the Affordance map shown as the result in the paper use the Affordance map generated by the last mask_decoder as is?

  2. If you use the Threshold method, do you remove the areas with low interest and then normalize them? 2-1) Is there another way to generate a specific Affordance map?

JasonQSY commented 6 days ago

Does the Affordance map shown as the result in the paper use the Affordance map generated by the last mask_decoder as is? If you use the Threshold method, do you remove the areas with low interest and then normalize them?

We do not threshold it. Instead, we sigmoid it and get the probability map (0-1). This is important for affordance map evaluation, since the metric only works for a distribution. If we threshold it, we can get some reasonable affordance map in visualization, but the metric will be really bad.

Is there another way to generate a specific Affordance map?

Can you explain what's a "specific" affordance map? Is it for specific objects?

wj-on-un commented 6 days ago

Thank you. I think it generates well without setting a separate threshold value by applying sigmoid to the learned Mask Decoder prediction value.

The reason I mentioned a specific method was because I was asking if there was another way to deal with the blocking artifacts that occurred in the generated Affordance map.

image

I used the mask decoder structure of the SAM model. OWL-ViT -> neck -> text_hidden_fcs -> prompt_encoder -> mask_decoder. And as in the paper, I added one transposed convoluation layer as follows.

(output_upscaling): Sequential( (0): ConvTranspose2d(256, 128, kernel_size=(2, 2), stride=(2, 2)) (1): LayerNorm2d() (2): GELU(approximate='none') (3): ConvTranspose2d(128, 64, kernel_size=(2, 2), stride=(2, 2)) (4): GELU(approximate='none') (5): ConvTranspose2d(64, 32, kernel_size=(2, 2), stride=(2, 2)) (6): GELU(approximate='none') )

JasonQSY commented 6 days ago

I see. Yes I've noticed similar artifacts during some of my experiments. There are some of my findings:

As you can see here, I don't have a perfect solution for this. But these findings might be helpful for your exploration. If you are reporting AffordanceLLM as a baseline, you don't need to worry about the artifact and you can report it as is.