Whether there is a problem in the number allocation of training data boxes?

showfaker66 commented 10 months ago

📚 Documentation Improvement

Thank you for your work! It helps me a lot. Recently, l have been training students in the categories of handwriting and printing on the examination paper, and the number of categories is more than 10 million, respectively. When defining categories, 1 is handwriting and 2 is priting. During training, handwriting is first trained to recognize, and printing willing be recognized after handwriting learned. Therefore, it is questionable whether there is a set quantity distribution threshold ,prior training within this threshold. Thank you for your answers! Have a nice day!

BloodAxe commented 10 months ago

Sorry I could not follow what is the issue you are facing. Would you mind explain it more clearly?

showfaker66 commented 10 months ago

OK.Thank you for your reply! I am training in four categories, namely handwriting, printing, drawing circles, and underline on student test papers. Their numbers are 14 million, 7 million, 250000, and 200000 respectively. During the training period, the AP50 of this printed category was always 0 on the validation set, and the other three categories were learned fairly well. I trained for 40 epochs, and it wasn't until the 8th epoch that I learned about the printing category。Due to the large amount of data, may there be a situation where the model sets a training data threshold of top_k during training.

BloodAxe commented 8 months ago

I think what you want to have is some sort of data sampling on the dataset/dataloader level to account the huge class imbalance that you have.

Out of the box in SG we don't have anything that fits your needs. But you probably can write your own sampler: 1) The idea is to get a presence matrix A [Num Samples, Num Classes] which represents the number of objects of each class in each sample of the dataset. The easiest way to get this matrix is to go over the dataset and count the number of objects for each sample. 2) From this matrix you can compute class weights for each class 3) Next you multiply computed class weight vector with presence matrix to get weighted sampling score for each row 4) Lastly - you can use this weight score in WeightedRandomSampler:

class_weights = sklearn.utils.class_weight.compute_class_weight("balanced", y=presense_matrix.sum(axis=0))

weights= (class_weights[None, :] * presense_matrix).sum(axis=1) # [Num Samples]

sampler = WeightedRandomSampler(num_samples, weights = weights, replacement=True)

train_data_loader = DataLoader(train_dataset, batch_size=..., shuffle=False, sampler = sampler)

num_samples is this example is the number of samples you want to use within one epoch. Given the number of training data you actually may want to use smaller number here (say 32K) of images per epoch to keep training time reasonable.

Hope this solves your issue.

Deci-AI / super-gradients

Whether there is a problem in the number allocation of training data boxes? #1672

📚 Documentation Improvement