PatWalters / practicalcheminformatics

Apache License 2.0
27 stars 4 forks source link

Building a multiclass classification model | Practical Cheminformatics #5

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Building a multiclass classification model | Practical Cheminformatics

Data cleaning, adding structures to PubChem data, building a multiclass model, dealing with imbalanced data

https://patwalters.github.io/practicalcheminformatics/jupyter/multiclass/pubchem/imbalanced/2021/08/28/multiclass-classification.html

UnixJunkie commented 3 years ago

I personally prefer bagging with balanced bootstraps over oversampling. But apart from that, cool post.

UnixJunkie commented 3 years ago

Are you only interested in a classifier for this dataset? I wonder if a good regressor can be trained from it.

PatWalters commented 3 years ago

Thanks for the comments. There's a lot more that I want to do with these datasets, stay tuned.

jhjensen2 commented 3 years ago

The precision for the activator really took a hit when oversampling. It's true that in the standard approach you hardly get any activator predictions, but when you do, there's a 58% chance it is correct, compared to a 27% for oversampling. Of course there's a large uncertainty in the 58% due to small sample size.

PatWalters commented 3 years ago

Good point, I should have gone into into the stats a bit more. I'm going to revise the post to include an assessment of the impact on precision and recall.

iwatobipen commented 3 years ago

Hi Pat, Thanks for great post! I always get lots of useful information from your post and code ;) To tackle imbalance data I think it's worth to check Greg's presentation. https://www.slideshare.net/GregLandrum1/building-useful-models-for-imbalanced-datasets-without-resampling-166150891 http://rdkit.blogspot.com/2018/11/working-with-unbalanced-data-part-i.html

In real drug discovery project, we often have imbalance data, so it's really useful. Thanks!

PatWalters commented 3 years ago

Thanks, Taka! Imbalanced data is an important topic and I plan to talk about it more in future posts. As I mentioned in my reply to Jan, I also need to dig more deeply into the stats.

iwatobipen commented 3 years ago

That sounds nice!!!!!!!