questions about performances

nikokks commented 1 year ago

Hello Guys,

First of all congratulations for your work! :)

I have a question: I could see that your performance on Cifar10:

at 20% noise/ratio is 96.5% accuracy
on 80% noise/ration is 82.15% accuracy

Would you have tried on 80% noise/ratio of: 1- run the script to get your benchmark as usual 2- saved the whole dataset with the modified labels 3- restarted the script with this modified dataset?

I haven't read all of your paper yet, so it's possible you realized that. If not, do you think the performance would increase even more? As I can see with 80% of noise label you have finally 82,5% off accuracy... It seems that it could be near the problem of a 20% noise/ratio finally... I know the problem is more difficult than that, so my question is "does iterating improve performance even more?" All based on the new initialisation of model weights..

Thanks a lot

In any case, congratulations! Your paper is superb and your code very clear.

buptcwk commented 1 year ago

Thanks for your question!

Iterating on the dataset is really a good idea. Although we did not try this on the PGDF before, our previous work [1] did a similar thing through this intuition. You can refer to Fig. 9 in the work [1]. In work [1], the experiment is set on a low noise ratio (20%) and after the first processing of the original dataset, the noise ratio is decreased to a quite low value (<1%). As a result, iterations bring very little performance gain. But I think it may work in the heavy noise scenario. Our team's further research may work on this.

Reference: [1] Zhu, Chuang, et al. "Hard sample aware noise robust learning for histopathology image classification." IEEE Transactions on Medical Imaging 41.4 (2021): 881-894.

nikokks commented 1 year ago

Hi ;) Tanks for your well documented reply !

I think by changing with another model in each iterations (resnet, vit, mobilenet, clip, etc.) can improve too ! I know that each model has its own "perception of vision". The more they will be different, the more the perception are different.

Another proposition is to change SGD optimizers by Adam or Adabelief (faster convergence, better convergence).

Would you be interested that I work on it with you ?

nikokks commented 1 year ago

Moreover, I think for better warmup convergence, using early stopping could increase your results. As I see you have fixed a variable to set the number of warmup_step. It could be better to use the best checkpoint on val to go on the second step training. With this you will have a "soft-parameter" than an "hard parameter".

nikokks commented 1 year ago

Another thing, I think that using the confidence entropy to get the prediction confidence (relative percentage from threshold ) could be better to evaluate if the data is in the noisy labels or not.. a better filtter based on the capacity of the model to get good predictions. It should remove some hard hyperparamters too 😉 and having better results..

buptcwk commented 1 year ago

Hi,

Your ideas are very interesting and impressive! Thanks a lot for your reply and invitation. However, I will graduate and start working in a company next month, so I may not have enough time to work on it in the future. Thanks again for your kindness and wish you success in your research 😊.

nikokks commented 1 year ago

Another tip, as I see your concurrents used larger models, maybe doing like them should improve results..

I have a question: when you talk about cifar10-sym90, you say that 10% only is good labeled and 90% is random label from the 10 classes ? If it is yes, I imagine that your work could label all type of image classification without any data labeled !

so Maybe trying to handle the problem with any labeled data could be a good think to test. If you are confident about this, the only problem should be to match on val the outputs by selecting the outputs based on the val.. If it not works maybe by doing some self-supervised learning like one of the last papers should help.

I think that open your future work to text classification and to tabular data, should make some noise in the domain.

buptcwk commented 1 year ago

The answer is yes. But when the noise ratio is at a high level, the performance becomes unstable. It is a common issue in many LNL algorithms. And work [1] mentioned that pretraining the model weight by contrast learning can significantly achieve performance gain. You can also try on this.

Reference:

[1] Zheltonozhskii, Evgenii, et al. "Contrast to divide: Self-supervised pre-training for learning with noisy labels." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022.

nikokks commented 1 year ago

I would try another thing not use the same model twice in the iteration but two different architectures. And waiting in warmup step that the 2 models converge independently

nikokks commented 1 year ago

Another thing (sorry for disturbing); it it better to quantify the overall probability of confidence over all the class than just using best prediciton for prob_his1 and prob_his2 like

prob1 = m*prob1_gmm + (1-m)*prob_his1
prob2 = m*prob2_gmm + (1-m)*prob_his2

prob_gmm is a probability of app appartenance. Like the NegEntropy (or other) I think it would be better to not use prob_his1 but the probability P from this kind of formula : lop(p_i) + sum ( log(1-p_not_i)) => log(P) this is like a binary crossentropy. or maybe using directly the NegEntropy and convert it to a scalar P: from sum(p_i log(p_i)) to P

nikokks commented 1 year ago

For args.md maybe trying:

a linear scheduler that converge linearly from 0 to 1
using a metric like accuracy on val (if 96% of accuracy) to fix m to 0.96. Something like that.
doing a clustering on the two points prob_mm and prob_his with GMM and predict the probability of appaertenance to the class (this is like one of your mainstream idea) With that you ll not have a hyperparamter args.md to fix.

nikokks commented 1 year ago

on the lines pred1 = prob1 > 0.5 pred2 = prob2 > 0.5 I think determining the best threshold that maximize f1_score or precision or recall (you have the choice) on the val (or train) could improve the thredhold you fixed as 0.5. Improving this threshold should improve convergence and reduce the number of epochs

nikokks commented 1 year ago

It should be better to not use this on hyperparmeter

lr = args.learning_rate
if epoch >= args.lr_switch_epoch:
            lr /= 10

but more a LrScheduler like these: lr_scheduler.LinearLR It can deliver the same results without having an hyperparameter or have better results. See this link https://pytorch.org/docs/stable/optim.html maybe by searching state of the arte in optimizers classifcation with LrScheduler should improve your results (faster convergence, better convergence)

nikokks commented 1 year ago

In definitive , the less hyperparameters you will have, the more stable your results will be. If you want some help in the next months, I can make a state of the art in the meantime to help you make the code :) It would be a pleasure to participate !!

buptcwk commented 1 year ago

Thanks again for your helpful advice! 👍 I wish your research goes well!

bupt-ai-cz / PGDF

questions about performances #4