bytedance / fc-clip

[NeurIPS 2023] This repo contains the code for our paper Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Apache License 2.0
285 stars 28 forks source link

Question about the ensemble code #4

Closed SuleBai closed 1 year ago

SuleBai commented 1 year ago

Hi, thanks for your great work.

I am confused about the ensemble code.

cls_logits_seen = (
    (in_vocab_cls_results ** (1 - alpha) * out_vocab_cls_probs**alpha).log()
    * category_overlapping_mask
)
cls_logits_unseen = (
    (in_vocab_cls_results ** (1 - beta) * out_vocab_cls_probs**beta).log()
    * (1 - category_overlapping_mask)
)
cls_results = cls_logits_seen + cls_logits_unseen

alpha=0.4, beta=0.8, and both in_vocab_cls_results and out_vocab_cls_results range between 0~1

Question is:

For the seen class, why you give out_vocab_cls more weight while give in_vocab_cls less weight? And the same question for the unseen class, why you give in_vocab_cls more weight? This appears counterintuitive, since for the unseen class, we should use out_vocab_cls more, and for the seen class, we should use in_vocab_cls more.

It really confused me, and I have tried to reset alpha and beta to 0.6 and 0.2(the reverse), but the results is much worse than the original. Could you give me some insight into it?

cornettoyu commented 1 year ago

Hi,

Thanks for your questions.

"For the seen class, why you give out-vocab-cls more weight while give in-vocab-cls less weight? And the same question for the unseen class, why you give in-vocab-cls more weight? This appears counterintuitive, since for the unseen class, we should use out-vocab-cls more, and for the seen class, we should use in-vocab-cls more."

I think there may be some misunderstanding. The code you pasted here put weight of (1-alpha)=0.6 to in-vocab and alpha=0.4 for out-vocab for seen classes, (1-beta)=0.2 to in-vocab and beta=0.8 for out-vocab for unseen classes, which is expected intuitively.

SuleBai commented 1 year ago

Hi @cornettoyu , thank you for your timely response. But I still remain confused.

Before ensemble code, both in_vocab_cls_results and out_vocab_cls_results have undergone softmax operations, resulting in values ranging from 0 to 1.

And if both the base and the exponent are within the range of 0 to 1, then as the exponent increases, the value actually becomes smaller. For example, if both in_vocab and out_vocab equal to 0.7, then out_vocab value would be larger than in_vocab value for the seen classes.

>>> in_vocab = 0.7
>>> out_vocab = 0.7
>>> in_vocab ** 0.6
0.8073443754472972
>>> out_vocab ** 0.4
0.8670401643811234

This is quite counterintuitive. Because for the seen class, it actually give out_vocab_cls more weight while give in_vocab_cls less weight. The same also applies to unseen class. Could you explain it?

Thanks again.

cornettoyu commented 1 year ago

Hi,

I'd like to illustrate with a simple example with 2-class:

in-vocab = [0.6, 0.4] out-vocab = [0.4, 0.6] (in-vocab * 0.6) (out-vocab * 0.4) = [0.7360219228178333, 0.5770799623628855] [0.6931448431551464, 0.8151931096059227] = [0.5101698002503163, 0.47043160900986947]

As it shows, the final prediction biases to in-vocab. Feel free to let me know if you have other questions :)

SuleBai commented 1 year ago

Thanks for your response! I misunderstood this before. It really helped me.