EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

List of ML algorithms used + can new ML algorithms be added with ease? #1117

Closed apavlo89 closed 4 years ago

apavlo89 commented 4 years ago

Hello everyone,

I am a bit of a beginner when it comes to ML. I can't seem to find the list of ML algorithms that TPOT uses. Can anyone list them please? Does it use catboost for example? If not, is there an easy way to add ML algorithms to test?

Regards,

Achilleas

apavlo89 commented 4 years ago

From what I see it doesn't use catboost. Also it would be great if Microsoft's LightGBM was added to TPOT

https://towardsdatascience.com/boosting-showdown-scikit-learn-vs-xgboost-vs-lightgbm-vs-catboost-in-sentiment-classification-f7c7f46fd956 https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

1) Either way, is it easy to add new algorithms for TPOT to test? If yes, could you point me where/how to add them? 2) Could I also have a complete list of the ML algorithms TPOT tests (preprocessing,classification/regression etc)?

Thank you for your help in the matter

apavlo89 commented 4 years ago

Also another question: is the cross-validation done on the selected training dataset or the whole dataset?

weixuanfu commented 4 years ago

Built-in TPOT configurations Customizing TPOT's operators and parameters CV is performed on the dataset that assigned to fit() function.

apavlo89 commented 4 years ago

Thank you for your help. Makes sense Am I right to think that a good representative accuracy on a SMALL dataset should give a high % x times (e.g., 5) cv kfold accuracy from the test set but also on the whole data set (X and Y). Is this logic sound or should I just do and only look at cv kfold accuracy score on the train set? Some of my runs with TPOT give me 100% accuracy with 5 kfold cross-validation on the training set but when I run a 5kfold CV on the whole dataset I get 60% accuracy. Some other runs give me a lower 5 kfold cv accuracy on the test set (e.g., 93%) but give me a much higher 5 kfold cv accuracy on the whole set (85%). I feel like I trust the ML pipeline with a lower cv score on the test set if it also provides a higher cv score on the whole dataset. Is this reasoning correct?

Just for some context my dataset has 20 samples with 132 features and it is a binary classification problem (EEG data from a sleep study) trained on .80 of the data. I wish I could collect more samples but corona is making it impossible!

By the way, I've been using PPS on my data first to reduce the number of features and then I apply TPOT and I get much higher scores (+5-10% accuracy) than running TPOT alone.

https://github.com/8080labs/ppscore https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598)

Would be amazing if this was added to TPOT. Maybe have a go on your data to see for yourself if it improves accuracy. Would be very curious to see.

apavlo89 commented 4 years ago

Actually I just read that for small datasets like mine is best to do LOOCV instead of kfold cv. About to complete a 1000 generation, 100 offspring LOOCV run and will post results.