ASUS-AICS / LibMultiLabel

A library for multi-class and multi-label classification
MIT License
152 stars 30 forks source link

Suggested configuration for training NN models on highly imbalanced datasets #229

Open henrykang7177 opened 1 year ago

henrykang7177 commented 1 year ago

Hi!

I have a binary classification dataset with highly imbalanced label distributions (pos : neg == 1 : 200)

I was trying to apply the BERT code in Neural Network Quick Start Tutorial directly on this dataset, with val metric set to "Macro-F1", but the trained model would mostly produce all negatives in this case.

I am wondering if there are parameters or configurations I could tune in LibMultiLabel for such an imbalanced dataset to improve the model's performance?

For your reference:

I also tried the linear method, where I saw using train_cost_sensitive instead of train_1vsrest improved noticeably on this issue. (with train_cost_sensitive, the model predicts 4 times more positive samples than with train_1vsrest. Although both methods have 'Micro-F1 and 'P@1' close to 0.99 (due to dominating negative samples) and Macro-F1 around 0.5)

Thanks!

cjlin1 commented 1 year ago

Can you provide more details of your exp settings? For example, configuration and exp log. Thanks

On 2022-11-30 15:58, henrykang7177 wrote:

Hi I have a binary classification dataset with highly imbalanced label distributions (pos : neg == 1 : 200)

I was trying to apply the BERT code in Neural Network Quick Start Tutorial [1] directly on this dataset, with val metric set to "Macro-F1", but the trained model would mostly produce all negatives in this case.

I am wondering if there are parameters or configurations I could tune in LibMultiLabel for such an imbalanced dataset to improve the model's performance?

For your reference:

I also tried the linear method, where I saw using train_cost_sensitive instead of train_1vsrest improved noticeably on this issue. (with train_cost_sensitive, the model predicts 4 times more positive samples than with train_1vsrest. Although both methods have 'Micro-F1 and @.***' close to 0.99 (due to dominating negative samples) and Macro-F1 around 0.5)

Thanks!

-- Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. You are receiving this because you are subscribed to this thread.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/ASUS-AICS/LibMultiLabel/issues/229", "url": "https://github.com/ASUS-AICS/LibMultiLabel/issues/229", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://www.csie.ntu.edu.tw/~cjlin/libmultilabel/api/nn_tutorial.html#neural-network-quickstart-tutorial [2] https://github.com/ASUS-AICS/LibMultiLabel/issues/229 [3] https://github.com/notifications/unsubscribe-auth/ABI3BHVOXY2O3TOLNKTSS53WK4CLZANCNFSM6AAAAAASPKRENM