8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.12k stars 168 forks source link

Assessing performance for different learning algorithms #2

Closed Benfeitas closed 4 years ago

Benfeitas commented 4 years ago

Hi! I'm wondering if you've compiled your performance tests on different learning algorithms? I'd like to see what you found, in addition to what you mention in the readme.

Thanks a lot in advance

8080labs commented 4 years ago

Hi Rui, thank you for reaching out. We did not compile further performance tests yet but went with a basic but reliable algorithm.

There are many papers and articles out there comparing the performance of machine learning algorithms on various datasets. I guess there are 3 takeaways:

Since the intention of the PPS is to get a quick understanding of some hidden patterns I don't think that it is critical to always choose the perfect learning algorithm as long as there are no massive differences. E.g. a massive difference would be a score of 0.05 vs 0.9 or 0.5.

Nevertheless, I would be happy to provide more performance comparisons and will share them on the repo and maybe it turns out that the DecisionTree should not be the standard algorithm going forward.

If you make some further tests on your data, I am curious to hear about the results!

Best, Florian

Benfeitas commented 4 years ago

Hi Florian,

Thanks so much for your very quick response!

I was precisely thinking on how these might do if we have thousands of features in a n >> p setting. In this case, and just like what happens if you were computing pairwise correlations, there will be a lot of false positives. So in addition to capturing non-linear relationships, as you so well highlight in your towards data science post, I was wondering if leveraging the asymetrical property of the PPS would be helpful or actually makes your life harder since it doubles the computational cost, and whether having other strategies (including bagging in RF) could help boost F1 scores.

I'm aware that it is this situation where you can easily test millions of associations that you want to avoid RF as they can become too slow for you then. Of course, one obvious solution is to do some preliminary feature selection and only compute the PPS from the reduced feature set :)

I'll pay attention to your posts here (or if you post these results on TDS I'd really appreciate if you could link it here). I will close this issue for now in any case.

Thanks once again!