Vanint / SADE-AgnosticLT

This repository is the official Pytorch implementation of Self-Supervised Aggregation of Diverse Experts for Test-Agnostic Long-Tailed Recognition (NeurIPS 2022).
MIT License
146 stars 20 forks source link

A question about perferance. #4

Closed XuZhengzhuo closed 2 years ago

XuZhengzhuo commented 2 years ago

A great job. Your work solves a wider range of LT problems.

But I m confused with TADE performance on the vanilla LT test set.

Actually, with the same backbone and training strategy, the following methods adopt almost the same loss, but the top-1 ACC varies, for example on CIFAR100-LT-IR-100:

In such a situation, TADE should get the best performance when the expert E2 (Eq.3 in this paper) mainly works. If so, it should not outperform the above methods by a large margin, right?

However, the TADE's top-1 ACC is 49.8% (cf. this paper Tab.8(a)) and the weight of experts is [0.40 0.35 0.24] (cf. this paper Tab. 12). The E1 mainly works.

So I just wondering how to explain the improvement of TADE on the vanilla test dataset?

Vanint commented 2 years ago

Hi, thanks very much for your attention to our work.

In such a situation, TADE should get the best performance when the expert E2 (Eq.3 in this paper) mainly works.

Yes. As shown in the below table (c.f. Table 11 in the appendix), the uniform expert E2 performs the best.

image

However, the TADE's top-1 ACC is 49.8% (cf. this paper Tab.8(a)) and the weight of experts is [0.40 0.35 0.24] (cf. this paper Tab. 12). The E1 mainly works.

This is quite an interesting phenomenon. We also considered this question before, when we saw the learned weights are not equal on the uniform test distribution of some datasets. It does not meet our initial expectation that the three weights should be roughly equal and the overall performance is the same as the average ensemble (i.e., without using our test-time self-supervised aggregation strategy). However, as shown in the below table (c.f. Table 13 in appendix), the performance on the uniform test distribution of CIFAR100-LT-100 is improved by 0.4% by our test aggregation strategy.

image

Therefore, we thought it is not a technical issue that generally leads to performance degradation. Here, we give one potential speculation for this phenomenon. As shown in the above first table, the average performance of the forward expert E1 on all classes is higher than that of the backward expert E3 on CIFAR100-LT-100. Considering this difference, the optimal weighting scheme on the uniform test distribution may not necessarily be the average ensemble; instead, it may mean a better trade-off among these three experts. Following this perspective, although there is no data number imbalance on the uniform test distribution, the results show that our test-time aggregation strategy can adaptively achieve a better trade-off among experts that leads to better overall performance. Such a phenomenon, demonstrating a potential advantage of our test-time aggregation strategy on the uniform test distribution, also surprises us.

In addition, since the performance improvement on the uniform test distribution by our test-time aggregation strategy is slight, it is okay to directly use the average ensemble (w/o the test-time aggregation) in practice, if you know the actual test distribution is uniform in advance. Happy to discuss further.

XuZhengzhuo commented 2 years ago

Thank you for your patient and detailed reply! This may be worth further study. It seems that just a multi-expert architecture can improve LT without special aggregation strategies. It is amazing. Anyway, many thanks again!

Vanint commented 2 years ago

It seems that just a multi-expert architecture can improve LT.

Yes, this has been demonstrated. You may refer to a recent survey of deep long-tailed learning (https://arxiv.org/pdf/2110.04596.pdf) for more related work of ensemble-based LT (c.f. Sec 3.3.4 and Sec 4 in this survey). Note that the proposed skill-diverse multi-expert framework in this paper has shown superiority to a simple multi-expert architecture like RIDE, which the previous state-of-the-art method.

without special aggregation strategies.

Aggregation is still important, especially when you face the test-agnostic LT scenarios that are more practical. Also, I agree with your opinion that this is worth further studying.