Vanint / SADE-AgnosticLT

This repository is the official Pytorch implementation of Self-Supervised Aggregation of Diverse Experts for Test-Agnostic Long-Tailed Recognition (NeurIPS 2022).
MIT License
146 stars 20 forks source link

Try TADE on custom dataset #9

Closed Lllllp93 closed 2 years ago

Lllllp93 commented 2 years ago

Hi,

Your excellent work really catches my eye, and I want to try TADE on my own dataset to test if it works for industry tasks, but the results doesn't look good compared with conventional methods like focal loss. Results is shown below:

TADE without Test Training: tr: 84 acc val: 87 acc TADE with Test Training 1 epoch: val 82.25 TADE with Test Training 5 epoch: val 38.41 TADE with Test Training 8 epoch: val 30 Focal Loss:tr: 88 acc val: 89 acc

It seems like that result of TADE without test-training is slightly worse than Focal Loss. And with the increase of Test-Training epochs, the accuracy becomes worse. The custom dataset is about a industrial defect classification task, so most pictures have similar background. And pictures can be divided into three categories. train datasetset cls_num_list = [2883,1019,56]

Question: I am not sure if it is because there are only three categories in my dataset, so that it is hard for the output_logit vector to represent similarity, which makes the performance of self-supervised aggregation worse. Do you have idea about that?

Vanint commented 2 years ago

Hi, thanks for your interest. Since different datasets may require using different neural architectures and training hyper-parameters, I am not sure what is the problem. My experience is that, under reasonable neural architectures and hyper-parameters, ensemble learning-based methods perform better than Focal loss a lot.

Let's first figure out the training phase. I think a potential reason for the worse performance of multi experts is that the data number is quite limited, so the network may not be trained well. Have you tried a smaller backbone? In addition, have you tried different hyper-parameters?

Lllllp93 commented 2 years ago

I have tried different backbones(ResNet32 & Resnet50),input image sizes(224&640), and initial learning rate. Model with Resnet50 backbone reaches higher accuracy. But both shows that accuracy becomes worse with the increase of Test-Training epochs. And I will try change the value of tau in DiverseExpertLoss.

And for the question I mentioned above, have you tried TADE on dataset with few categories? I'm also curious that why you choose logit vector to calculate similarity instead of return_feat(embedding vector before logit vector).

Vanint commented 2 years ago

Hi, we have verified our method on CIFAR10-LT, which has 10 classes. As for tasks with less than 10 classes, we have not tried.

Choosing logit vector to calculate similarity instead of return_feat is because we only adjust the aggregation weight of different experts, while the model parameters do not update.

I think there may be something wrong. Can you report the performance of each expert? Let me see what is the problem.

Lllllp93 commented 2 years ago

TADE: 224pixel,200 epoch,bs:64,backbone:resnet50 num_cls in te_dataset:[2883,1019,56],num_cls in te_dataset:[349,125,5]

Lllllp93 commented 2 years ago

It seems like each expert is doing what it's good at, but test-aggregation doesn't improve the result

Vanint commented 2 years ago

One issue is Expert 2, whose many-shot performance is 37.63, and medium-shot is 100. Why is the all performance is 21?

Vanint commented 2 years ago

Expert 0 and expert 1 have the same overall performance?

In addition, it seems that expert 0 has not been trained well, since it should perform better on many-shot classes than expert 1, form my experience.

Since the class number is only 3 and no performance on few-shot classes (although it depends on your setting how to divide classes),maybe you can consider adjust the expert number to 2, and train the expert 0 better.

Vanint commented 2 years ago

Before using test-time aggregation, we should first ensure that each expert has been trained well and is good at the corresponding classes.

Lllllp93 commented 2 years ago

One issue is Expert 2, whose many-shot performance is 37.63, and medium-shot is 100. Why is the all performance is 21?

I think that is because number of medium-shot(test data) is only 5 pics and num of many-shot is about 480 pics, which means overall performance is mainly affected by performance of many-shot.

Expert 0 and expert 1 have the same overall performance? In addition, it seems that expert 0 has not been trained well, since it should perform better on many-shot classes than expert 1, form my experience.

Below is the performance of last epoch(above is model_best.pth), expert 0 performs better on many-shot classes than expert 1.

Since the class number is only 3 and no performance on few-shot classes (although it depends on your setting how to divide classes),maybe you can consider adjust the expert number to 2, and train the expert 0 better.

I think change the number of expert may help for my dataset, I will try it.

Before using test-time aggregation, we should first ensure that each expert has been trained well and is good at the corresponding classes.

Agree with you. Btw, I found that training loss is about two times higher than val_loss and accuracy of training set is 20% lower than val_set at the end of training phase. Does it mean the model is not trained well? And did you get similar accuracy and loss for the training set and val_set when you got the best performance.

Vanint commented 2 years ago

Below is the performance of last epoch(above is model_best.pth), expert 0 performs better on many-shot classes than expert 1.

I see. since your expert 1 has achieved 100% on the third class, you can simply use expert 0 with Cross-entropy and expert 1 with balanced softmax. There is no need to use expert 2 with inverse sofmtax.

If only two experts, how about the performance of the best model and the last model? I am quite interested in the performance with the average ensemble.

I found that training loss is about two times higher than val_loss and accuracy of training set is 20% lower than val_set at the end of training phase. Does it mean the model is not trained well? And did you get similar accuracy and loss for the training set and val_set when you got the best performance.

I did not see a similar phenomenon of "training acc is lower than validation acc" on the long-tailed learning datasets, but I guess there are two possible reasons: (1) the model was not trained well; (2) the validation set is too small, leading to higher ACC. In addition, the train_acc is typically higher than val_acc.

Lllllp93 commented 2 years ago

If only two experts, how about the performance of the best model and the last model? I am quite interested in the performance with the average ensemble.

Seems like results with expert 0(forward) and expert1(uniform) are more resonable, test-aggregation makes ACC higher. Besides, expert 0 has greater weight than expert 1, which is consistent with my Forward-LT test set.

In addition, training_loss is shown below. It seems like training_loss didn't reduce after 180 epochs, which may proves that the model was 'trained well'. So does it mean expert 2(backward) has a negative impact for my task? If so, maybe remove expert2 with inverse softmax or reduce the value of tau may help. loss_train

Vanint commented 2 years ago

In addition, training_loss is shown below. It seems like training_loss didn't reduce after 180 epochs, which may proves that the model was 'trained well'. So does it mean expert 2(backward) has a negative impact for my task? If so, maybe remove expert2 with inverse softmax or reduce the value of tau may help.

I think so. Since the class number is very few, it is not necessary to construct Expert 2.