Concatenate Feature - Githubissues

wj-on-un commented 2 weeks ago

Hello, I have a question.

First of all, I would like to ask if you used Depth image in the Pre-training process to train mm_projector.

If not, the shape of mm_projector in Pre-training and Fine-tuning is different, so it cannot be used.

When concatenating Image Feature (batch, sequence, dimension) and Depth Image Feature (batch, sequence, dimension), the Feature shape becomes (batch, sequence, dimension * 2).

Or do you mean project both image and depth image onto the trained mm_projector, then concatenate the two images, concatenate the 144 image feature tokens and the 144 depth image tokens, and then feed them into the LLM? (like show two images)

JasonQSY commented 2 weeks ago

I would like to ask if you used Depth image in the Pre-training process to train mm_projector.

I think that will be a good idea. We don't do that since we do not have enough computing resources.

If not, the shape of mm_projector in Pre-training and Fine-tuning is different, so it cannot be used.

I don't think so. mm_projector is a linear layer and only transforms the last dimension of the input features into the correct size. What matters is the visual encoder.

Or do you mean project both image and depth image onto the trained mm_projector, then concatenate the two images, concatenate the 144 image feature tokens and the 144 depth image tokens, and then feed them into the LLM? (like show two images)

This is correct. RGB image and depth image are passed into visual encoder separately.

wj-on-un commented 1 week ago

Hello. I have some additional questions.

Does the Affordance map shown as the result in the paper use the Affordance map generated by the last mask_decoder as is?
If you use the Threshold method, do you remove the areas with low interest and then normalize them? 2-1) Is there another way to generate a specific Affordance map?

JasonQSY commented 6 days ago

Does the Affordance map shown as the result in the paper use the Affordance map generated by the last mask_decoder as is? If you use the Threshold method, do you remove the areas with low interest and then normalize them?

We do not threshold it. Instead, we sigmoid it and get the probability map (0-1). This is important for affordance map evaluation, since the metric only works for a distribution. If we threshold it, we can get some reasonable affordance map in visualization, but the metric will be really bad.

Is there another way to generate a specific Affordance map?

Can you explain what's a "specific" affordance map? Is it for specific objects?

wj-on-un commented 6 days ago

Thank you. I think it generates well without setting a separate threshold value by applying sigmoid to the learned Mask Decoder prediction value.

The reason I mentioned a specific method was because I was asking if there was another way to deal with the blocking artifacts that occurred in the generated Affordance map.

I used the mask decoder structure of the SAM model. OWL-ViT -> neck -> text_hidden_fcs -> prompt_encoder -> mask_decoder. And as in the paper, I added one transposed convoluation layer as follows.

(output_upscaling): Sequential( (0): ConvTranspose2d(256, 128, kernel_size=(2, 2), stride=(2, 2)) (1): LayerNorm2d() (2): GELU(approximate='none') (3): ConvTranspose2d(128, 64, kernel_size=(2, 2), stride=(2, 2)) (4): GELU(approximate='none') (5): ConvTranspose2d(64, 32, kernel_size=(2, 2), stride=(2, 2)) (6): GELU(approximate='none') )

JasonQSY commented 6 days ago

I see. Yes I've noticed similar artifacts during some of my experiments. There are some of my findings:

First, as you mentioned, the additional transposed conv layer is helpful and has mitigated it a lot. Actually it's hard to identify artifacts without looking at the bottom image.
The artifacts look bad, but it do not influence the metric, as the metric is looking at the similarity between predicted and gt distribution. There are some discrepancy between the result and the metric.
To mitigate it completely, I've tried Gaussian blur, and it does the job quite well, although it does not change the metric at all (see below). But the numbers I report in the paper do not use Gaussian blur.
I believe the dataset is an issue, because AGD20K has perfect 2D gaussian map (since it's computed from 2D gaussian center and sigmoid mathematically). The model has difficulty learning a perfect 2D gaussian with limited training data and that's where the artifacts come from. But there are no more labeled data I can try.

As you can see here, I don't have a perfect solution for this. But these findings might be helpful for your exploration. If you are reporting AffordanceLLM as a baseline, you don't need to worry about the artifact and you can report it as is.

JasonQSY / AffordanceLLM

Concatenate Feature #2