Dealing with class imbalance in Deep Learning

Benjamin-Lee / deep-rules

Ten Quick Tips for Deep Learning in Biology

https://benjamin-lee.github.io/deep-rules/

Other

227 stars 45 forks source link

Dealing with class imbalance in Deep Learning #193

Open souravsingh opened 5 years ago

souravsingh commented 5 years ago

Have you checked the list of proposed tips to see if the tip has already been proposed?

[x] Yes

Did you add yourself as a contributor by making a pull request if this is your first contribution?

[ ] Yes, I added myself or am already a contributor

Feel free to elaborate, rant, and/or ramble. There might be a imbalance in the class distribution, which is quite common in Bioinformatics problems. I believe most of the points regarding dealing with imbalance in ML should work in Deep Learning as well-

1) Try rephrasing the problem 2) Obtain more data 3) Tweak weights appropriately for class imbalance 4) Applying Regularization techniques 5) Use Oversampling or Undersampling techniques(?) 6) Using K-fold CV in the correct way

Any citations for the rule? (peer-reviewed literature preferred but not required)

agitter commented 5 years ago

I agree that class imbalance is a common issue in biology. How much of the discussion would be specific to deep learning as opposed to general ML? If the solutions are general, we may only mention it briefly instead of making a full tip.

Do the solutions of rephrasing the problem and obtaining more data apply in biology? In settings like genome annotation or chemical bioactivity classification, the domain is inherently dominated by negatives regardless of how much data we acquire.

This topic also fits with the brief sentence we have now about ROC having limited utility for class imbalanced problems.

rasbt commented 5 years ago

I agree that class imbalance is a common issue in biology. How much of the discussion would be specific to deep learning as opposed to general ML? If the solutions are general, we may only mention it briefly instead of making a full tip.

Good point. I am not sure how successful this is in general, but I stumbled upon a paper recently where the researchers used GANs to generate synthetic samples for addressing the imbalance issue. However, in general, I think DL is not more prone or immune to imbalancing then other ML approaches.

One approach though that is more DL specific is the Focal Loss that was first proposed for the RetinaNet, for example.

Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988). (https://arxiv.org/abs/1708.02002)

souravsingh commented 5 years ago

I believe obtaining more data points can help for certain problems like problems in cancer genomics, where a lab could tap into the private data generated to help solve the problem.

souravsingh commented 5 years ago

In line with @rasbt comment on GANs, I remember reading a paper which used RNNs to generate protein sequences having a certain type of activity. We could mention this as part of how to get more data samples.