SPR identify noisy data

Yikai-Wang / SPR-LNL

This is the official repo for our CVPR22 paper: Scalable Penalized Regression for Noise Detection in Learning With Noisy Labels.

17 stars 2 forks source link

SPR identify noisy data #1

Closed macaulishxcoo closed 1 year ago

macaulishxcoo commented 1 year ago

Hello author, I trained my own data using the method in your paper, and I output all the noise data. I analyzed these noise data, but some of them were not very accurate. Although I achieved high accuracy through the method in the paper, it is based on removing too much data. I want to relax the standards for identifying noisy data, although it may reduce accuracy. What should I do? thanks

Yikai-Wang commented 1 year ago

Hello there,

I'm not entirely sure if I fully grasp the problem, it seems that you want to ensure that the majority of identified noisy data are indeed true noisy data.

To achieve this, you can employ a simple trick by estimating the noise ratio of your training set and selecting the expected clean ratio of the data, which is simply set at 50% in our implementation. This should help reduce biased selection due to the fixed selection ratio.

In our recent work Knockoffs-SPR, we have systematically studied the issue of controlling the false-selection-rate when selecting a clean subset from the training data. This method guarantees the smallest possible FSR and might be helpful for your problem.

While this recent work is currently undergoing review for potential publication, we plan to release the code once it's ready for publication. Additionally, our manuscript is available on arXiv, where you might find some useful suggestions.

I hope this helps.

macaulishxcoo commented 1 year ago

Thanks for your reply!!!

I did encounter a problem where some clean datasets were mistakenly identified as noisy data, so I want to relax the proportion of recognition to clean data. How should I adjust the proportion of clean ratio of the data? The SPR method in the code are a bit difficult to understand.

Yikai-Wang commented 1 year ago

https://github.com/Yikai-Wang/SPR-LNL/blob/af29e431bf7bd4d7840ce87c08b2ae9ebafeb7d1/models/spr.py#L85-L86

Replace the value 0.5 with the desired ratio of your choice.

macaulishxcoo commented 1 year ago

Thank you for your reply!

In practical experiments, I found some industrial defect data to train using the methods in the paper. These data have the following characteristics: 1. uneven distribution, with hundreds of data for a certain classification and over 10000 data for a certain classification; 2. The data in the same category varies greatly and can in fact be divided into several categories in practice.

for this type of data, I used SPR method and the training speed was very slow. Therefore, I added a Mobilenetv3 network, but it did not work and the training speed was still very slow. I tried using small sample data and the results were decent, but when I encountered this extreme data situation, it seemed like something was wrong and I had no idea.

Yikai-Wang commented 1 year ago

It seems that the main concern in your application revolves around an imbalance in the training data. This issue can be effectively addressed by employing specific methods.

When applying SPR, we divide the dataset to tackle the imbalance among different classes in the training data, the data from the smaller classes will be repeatedly used in different splits. As a result, this situation might unduly impact the outcomes obtained from SPR if the size of smaller classes is limited. It's important to highlight that I haven't carried out similar experiments in scenarios characterized by significant imbalances. Therefore, I cannot definitively confirm if this is the precise underlying cause.