MontagueM / TBPS_Team8

0 stars 1 forks source link

DISCUSSION: Classification best class sizes #7

Open vidh2000 opened 2 years ago

vidh2000 commented 2 years ago

Training on more samples = better. 750k data samples took an hour. That is not too much. But different classes don't have an equal number of events in total; 3 classes have ~1000 events only, while the others have 800k events on their own. I think we can't just duplicate these ~1000 event classes 800x to obtain training data of the same size as the other classes - it would teach the algorithm that those decays look EXACTLY like those 1000 examples purely because we copied them 800x. Possible solutions: ((1))we resample these small classes to obtain more events; cons: how to do it? i guess we could sample if features followed some distributions. pros: if done successfully, it is a good way to obtain more "correct" events" corresponding to these classes. ((2)) We can also just train on classes with ~ 1000 events; cons: I don't know if 1000 events per class is good enough for the algorithm. Accuracy was still great using gradient boosting, but it feels stupid as I have 3million samples in total available to test/train and I only used 10k... pros: equal-sized classes. No biases towards each event. Note, accuracy for some non-signal classes in the case when 400k events per class were used was still great. There was one class, which had half of events correctly classified and half of them in some other class however - is bad. What to do?

MontagueM commented 2 years ago

Yo I think it's best to change the direction of the ML stuff - talked to Mingsong about it and how would this go for two ideas 1:

another key thing to note is visualisations are a key part of this all, so always use confusion matrices, ROC and AUC plots, weight bar plots

vidh2000 commented 2 years ago

I agree though I am not really sure how you mean to train now... When could we meet next time again? I will try to finalise the classifier today and push to github... And after that I can focus on how and what to classify too.